1. 15 5月, 2020 1 次提交
    • X
      alinux: add tcprt framework to kernel · f84e8fa0
      xuanzhuo 提交于
      to #26353046
      
      TcpRT: Instrument and Diagnostic Analysis System for Service Quality
      of Cloud Databases at Massive Scale in Real-time.
      
      It can also provide information for all request/response services. Such as
      HTTP request.
      
      This is the kernel framework for tcprt, more work needs tcprt module
      support.
      
      TcpRt module should call tcp_unregitsert_rt before rmmod.
      
      TcpRt hooks will be called when sock init, recv data, send data,
      packet acked and socket been destroy. The private data save to
      icsk->icsk_tcp_rt_priv.
      Reviewed-by: NCambda Zhu <cambda@linux.alibaba.com>
      Acked-by: NDust Li <dust.li@linux.alibaba.com>
      Signed-off-by: Nxuanzhuo <xuanzhuo@linux.alibaba.com>
      f84e8fa0
  2. 25 7月, 2018 1 次提交
  3. 02 3月, 2018 1 次提交
  4. 13 10月, 2017 1 次提交
  5. 15 2月, 2017 1 次提交
    • S
      esp: Add a software GRO codepath · 7785bba2
      Steffen Klassert 提交于
      This patch adds GRO ifrastructure and callbacks for ESP on
      ipv4 and ipv6.
      
      In case the GRO layer detects an ESP packet, the
      esp{4,6}_gro_receive() function does a xfrm state lookup
      and calls the xfrm input layer if it finds a matching state.
      The packet will be decapsulated and reinjected it into layer 2.
      Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>
      7785bba2
  6. 09 2月, 2017 1 次提交
  7. 29 11月, 2016 1 次提交
  8. 24 10月, 2016 1 次提交
    • C
      net: ip, diag -- Add diag interface for raw sockets · 432490f9
      Cyrill Gorcunov 提交于
      In criu we are actively using diag interface to collect sockets
      present in the system when dumping applications. And while for
      unix, tcp, udp[lite], packet, netlink it works as expected,
      the raw sockets do not have. Thus add it.
      
      v2:
       - add missing sock_put calls in raw_diag_dump_one (by eric.dumazet@)
       - implement @destroy for diag requests (by dsa@)
      
      v3:
       - add export of raw_abort for IPv6 (by dsa@)
       - pass net-admin flag into inet_sk_diag_fill due to
         changes in net-next branch (by dsa@)
      
      v4:
       - use @pad in struct inet_diag_req_v2 for raw socket
         protocol specification: raw module carries sockets
         which may have custom protocol passed from socket()
         syscall and sole @sdiag_protocol is not enough to
         match underlied ones
       - start reporting protocol specifed in socket() call
         when sockets are raw ones for the same reason: user
         space tools like ss may parse this attribute and use
         it for socket matching
      
      v5 (by eric.dumazet@):
       - use sock_hold in raw_sock_get instead of atomic_inc,
         we're holding (raw_v4_hashinfo|raw_v6_hashinfo)->lock
         when looking up so counter won't be zero here.
      
      v6:
       - use sdiag_raw_protocol() helper which will access @pad
         structure used for raw sockets protocol specification:
         we can't simply rename this member without breaking uapi
      
      v7:
       - sine sdiag_raw_protocol() helper is not suitable for
         uapi lets rather make an alias structure with proper
         names. __check_inet_diag_req_raw helper will catch
         if any of structure unintentionally changed.
      
      CC: David S. Miller <davem@davemloft.net>
      CC: Eric Dumazet <eric.dumazet@gmail.com>
      CC: David Ahern <dsa@cumulusnetworks.com>
      CC: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      CC: James Morris <jmorris@namei.org>
      CC: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      CC: Patrick McHardy <kaber@trash.net>
      CC: Andrey Vagin <avagin@openvz.org>
      CC: Stephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      432490f9
  9. 21 9月, 2016 1 次提交
    • N
      tcp_bbr: add BBR congestion control · 0f8782ea
      Neal Cardwell 提交于
      This commit implements a new TCP congestion control algorithm: BBR
      (Bottleneck Bandwidth and RTT). A detailed description of BBR will be
      published in ACM Queue, Vol. 14 No. 5, September-October 2016, as
      "BBR: Congestion-Based Congestion Control".
      
      BBR has significantly increased throughput and reduced latency for
      connections on Google's internal backbone networks and google.com and
      YouTube Web servers.
      
      BBR requires only changes on the sender side, not in the network or
      the receiver side. Thus it can be incrementally deployed on today's
      Internet, or in datacenters.
      
      The Internet has predominantly used loss-based congestion control
      (largely Reno or CUBIC) since the 1980s, relying on packet loss as the
      signal to slow down. While this worked well for many years, loss-based
      congestion control is unfortunately out-dated in today's networks. On
      today's Internet, loss-based congestion control causes the infamous
      bufferbloat problem, often causing seconds of needless queuing delay,
      since it fills the bloated buffers in many last-mile links. On today's
      high-speed long-haul links using commodity switches with shallow
      buffers, loss-based congestion control has abysmal throughput because
      it over-reacts to losses caused by transient traffic bursts.
      
      In 1981 Kleinrock and Gale showed that the optimal operating point for
      a network maximizes delivered bandwidth while minimizing delay and
      loss, not only for single connections but for the network as a
      whole. Finding that optimal operating point has been elusive, since
      any single network measurement is ambiguous: network measurements are
      the result of both bandwidth and propagation delay, and those two
      cannot be measured simultaneously.
      
      While it is impossible to disambiguate any single bandwidth or RTT
      measurement, a connection's behavior over time tells a clearer
      story. BBR uses a measurement strategy designed to resolve this
      ambiguity. It combines these measurements with a robust servo loop
      using recent control systems advances to implement a distributed
      congestion control algorithm that reacts to actual congestion, not
      packet loss or transient queue delay, and is designed to converge with
      high probability to a point near the optimal operating point.
      
      In a nutshell, BBR creates an explicit model of the network pipe by
      sequentially probing the bottleneck bandwidth and RTT. On the arrival
      of each ACK, BBR derives the current delivery rate of the last round
      trip, and feeds it through a windowed max-filter to estimate the
      bottleneck bandwidth. Conversely it uses a windowed min-filter to
      estimate the round trip propagation delay. The max-filtered bandwidth
      and min-filtered RTT estimates form BBR's model of the network pipe.
      
      Using its model, BBR sets control parameters to govern sending
      behavior. The primary control is the pacing rate: BBR applies a gain
      multiplier to transmit faster or slower than the observed bottleneck
      bandwidth. The conventional congestion window (cwnd) is now the
      secondary control; the cwnd is set to a small multiple of the
      estimated BDP (bandwidth-delay product) in order to allow full
      utilization and bandwidth probing while bounding the potential amount
      of queue at the bottleneck.
      
      When a BBR connection starts, it enters STARTUP mode and applies a
      high gain to perform an exponential search to quickly probe the
      bottleneck bandwidth (doubling its sending rate each round trip, like
      slow start). However, instead of continuing until it fills up the
      buffer (i.e. a loss), or until delay or ACK spacing reaches some
      threshold (like Hystart), it uses its model of the pipe to estimate
      when that pipe is full: it estimates the pipe is full when it notices
      the estimated bandwidth has stopped growing. At that point it exits
      STARTUP and enters DRAIN mode, where it reduces its pacing rate to
      drain the queue it estimates it has created.
      
      Then BBR enters steady state. In steady state, PROBE_BW mode cycles
      between first pacing faster to probe for more bandwidth, then pacing
      slower to drain any queue that created if no more bandwidth was
      available, and then cruising at the estimated bandwidth to utilize the
      pipe without creating excess queue. Occasionally, on an as-needed
      basis, it sends significantly slower to probe for RTT (PROBE_RTT
      mode).
      
      BBR has been fully deployed on Google's wide-area backbone networks
      and we're experimenting with BBR on Google.com and YouTube on a global
      scale.  Replacing CUBIC with BBR has resulted in significant
      improvements in network latency and application (RPC, browser, and
      video) metrics. For more details please refer to our upcoming ACM
      Queue publication.
      
      Example performance results, to illustrate the difference between BBR
      and CUBIC:
      
      Resilience to random loss (e.g. from shallow buffers):
        Consider a netperf TCP_STREAM test lasting 30 secs on an emulated
        path with a 10Gbps bottleneck, 100ms RTT, and 1% packet loss
        rate. CUBIC gets 3.27 Mbps, and BBR gets 9150 Mbps (2798x higher).
      
      Low latency with the bloated buffers common in today's last-mile links:
        Consider a netperf TCP_STREAM test lasting 120 secs on an emulated
        path with a 10Mbps bottleneck, 40ms RTT, and 1000-packet bottleneck
        buffer. Both fully utilize the bottleneck bandwidth, but BBR
        achieves this with a median RTT 25x lower (43 ms instead of 1.09
        secs).
      
      Our long-term goal is to improve the congestion control algorithms
      used on the Internet. We are hopeful that BBR can help advance the
      efforts toward this goal, and motivate the community to do further
      research.
      
      Test results, performance evaluations, feedback, and BBR-related
      discussions are very welcome in the public e-mail list for BBR:
      
        https://groups.google.com/forum/#!forum/bbr-dev
      
      NOTE: BBR *must* be used with the fq qdisc ("man tc-fq") with pacing
      enabled, since pacing is integral to the BBR design and
      implementation. BBR without pacing would not function properly, and
      may incur unnecessary high packet loss rates.
      Signed-off-by: NVan Jacobson <vanj@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNandita Dukkipati <nanditad@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0f8782ea
  10. 11 6月, 2016 1 次提交
  11. 18 2月, 2016 1 次提交
  12. 17 2月, 2016 1 次提交
  13. 26 1月, 2016 1 次提交
  14. 16 12月, 2015 1 次提交
  15. 28 8月, 2015 1 次提交
  16. 11 6月, 2015 1 次提交
    • K
      tcp: add CDG congestion control · 2b0a8c9e
      Kenneth Klette Jonassen 提交于
      CAIA Delay-Gradient (CDG) is a TCP congestion control that modifies
      the TCP sender in order to [1]:
      
        o Use the delay gradient as a congestion signal.
        o Back off with an average probability that is independent of the RTT.
        o Coexist with flows that use loss-based congestion control, i.e.,
          flows that are unresponsive to the delay signal.
        o Tolerate packet loss unrelated to congestion. (Disabled by default.)
      
      Its FreeBSD implementation was presented for the ICCRG in July 2012;
      slides are available at http://www.ietf.org/proceedings/84/iccrg.html
      
      Running the experiment scenarios in [1] suggests that our implementation
      achieves more goodput compared with FreeBSD 10.0 senders, although it also
      causes more queueing delay for a given backoff factor.
      
      The loss tolerance heuristic is disabled by default due to safety concerns
      for its use in the Internet [2, p. 45-46].
      
      We use a variant of the Hybrid Slow start algorithm in tcp_cubic to reduce
      the probability of slow start overshoot.
      
      [1] D.A. Hayes and G. Armitage. "Revisiting TCP congestion control using
          delay gradients." In Networking 2011, pages 328-341. Springer, 2011.
      [2] K.K. Jonassen. "Implementing CAIA Delay-Gradient in Linux."
          MSc thesis. Department of Informatics, University of Oslo, 2015.
      
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: David Hayes <davihay@ifi.uio.no>
      Cc: Andreas Petlund <apetlund@simula.no>
      Cc: Dave Taht <dave.taht@bufferbloat.net>
      Cc: Nicolas Kuhn <nicolas.kuhn@telecom-bretagne.eu>
      Signed-off-by: NKenneth Klette Jonassen <kennetkl@ifi.uio.no>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2b0a8c9e
  17. 14 5月, 2015 1 次提交
  18. 06 11月, 2014 1 次提交
    • T
      net: Move fou_build_header into fou.c and refactor · 63487bab
      Tom Herbert 提交于
      Move fou_build_header out of ip_tunnel.c and into fou.c splitting
      it up into fou_build_header, gue_build_header, and fou_build_udp.
      This allows for other users for TX of FOU or GUE. Change ip_tunnel_encap
      to call fou_build_header or gue_build_header based on the tunnel
      encapsulation type. Similarly, added fou_encap_hlen and gue_encap_hlen
      functions which are called by ip_encap_hlen. New net/fou.h has
      prototypes and defines for this.
      
      Added NET_FOU_IP_TUNNELS configuration. When this is set, IP tunnels
      can use FOU/GUE and fou module is also selected.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      63487bab
  19. 07 10月, 2014 1 次提交
    • A
      openvswitch: fix a compilation error when CONFIG_INET is not setW! · 7c5df8fa
      Andy Zhou 提交于
      Fix a openvswitch compilation error when CONFIG_INET is not set:
      
      =====================================================
         In file included from include/net/geneve.h:4:0,
                             from net/openvswitch/flow_netlink.c:45:
      		          include/net/udp_tunnel.h: In function 'udp_tunnel_handle_offloads':
      			  >> include/net/udp_tunnel.h:100:2: error: implicit declaration of function 'iptunnel_handle_offloads' [-Werror=implicit-function-declaration]
      			  >>      return iptunnel_handle_offloads(skb, udp_csum, type);
      			  >>           ^
      			  >>           >> include/net/udp_tunnel.h:100:2: warning: return makes pointer from integer without a cast
      			  >>           >>    cc1: some warnings being treated as errors
      
      =====================================================
      Reported-by: Nkbuild test robot <fengguang.wu@intel.com>
      Signed-off-by: NAndy Zhou <azhou@nicira.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7c5df8fa
  20. 06 10月, 2014 1 次提交
  21. 29 9月, 2014 1 次提交
    • D
      net: tcp: add DCTCP congestion control algorithm · e3118e83
      Daniel Borkmann 提交于
      This work adds the DataCenter TCP (DCTCP) congestion control
      algorithm [1], which has been first published at SIGCOMM 2010 [2],
      resp. follow-up analysis at SIGMETRICS 2011 [3] (and also, more
      recently as an informational IETF draft available at [4]).
      
      DCTCP is an enhancement to the TCP congestion control algorithm for
      data center networks. Typical data center workloads are i.e.
      i) partition/aggregate (queries; bursty, delay sensitive), ii) short
      messages e.g. 50KB-1MB (for coordination and control state; delay
      sensitive), and iii) large flows e.g. 1MB-100MB (data update;
      throughput sensitive). DCTCP has therefore been designed for such
      environments to provide/achieve the following three requirements:
      
        * High burst tolerance (incast due to partition/aggregate)
        * Low latency (short flows, queries)
        * High throughput (continuous data updates, large file
          transfers) with commodity, shallow buffered switches
      
      The basic idea of its design consists of two fundamentals: i) on the
      switch side, packets are being marked when its internal queue
      length > threshold K (K is chosen so that a large enough headroom
      for marked traffic is still available in the switch queue); ii) the
      sender/host side maintains a moving average of the fraction of marked
      packets, so each RTT, F is being updated as follows:
      
       F := X / Y, where X is # of marked ACKs, Y is total # of ACKs
       alpha := (1 - g) * alpha + g * F, where g is a smoothing constant
      
      The resulting alpha (iow: probability that switch queue is congested)
      is then being used in order to adaptively decrease the congestion
      window W:
      
       W := (1 - (alpha / 2)) * W
      
      The means for receiving marked packets resp. marking them on switch
      side in DCTCP is the use of ECN.
      
      RFC3168 describes a mechanism for using Explicit Congestion Notification
      from the switch for early detection of congestion, rather than waiting
      for segment loss to occur.
      
      However, this method only detects the presence of congestion, not
      the *extent*. In the presence of mild congestion, it reduces the TCP
      congestion window too aggressively and unnecessarily affects the
      throughput of long flows [4].
      
      DCTCP, as mentioned, enhances Explicit Congestion Notification (ECN)
      processing to estimate the fraction of bytes that encounter congestion,
      rather than simply detecting that some congestion has occurred. DCTCP
      then scales the TCP congestion window based on this estimate [4],
      thus it can derive multibit feedback from the information present in
      the single-bit sequence of marks in its control law. And thus act in
      *proportion* to the extent of congestion, not its *presence*.
      
      Switches therefore set the Congestion Experienced (CE) codepoint in
      packets when internal queue lengths exceed threshold K. Resulting,
      DCTCP delivers the same or better throughput than normal TCP, while
      using 90% less buffer space.
      
      It was found in [2] that DCTCP enables the applications to handle 10x
      the current background traffic, without impacting foreground traffic.
      Moreover, a 10x increase in foreground traffic did not cause any
      timeouts, and thus largely eliminates TCP incast collapse problems.
      
      The algorithm itself has already seen deployments in large production
      data centers since then.
      
      We did a long-term stress-test and analysis in a data center, short
      summary of our TCP incast tests with iperf compared to cubic:
      
      This test measured DCTCP throughput and latency and compared it with
      CUBIC throughput and latency for an incast scenario. In this test, 19
      senders sent at maximum rate to a single receiver. The receiver simply
      ran iperf -s.
      
      The senders ran iperf -c <receiver> -t 30. All senders started
      simultaneously (using local clocks synchronized by ntp).
      
      This test was repeated multiple times. Below shows the results from a
      single test. Other tests are similar. (DCTCP results were extremely
      consistent, CUBIC results show some variance induced by the TCP timeouts
      that CUBIC encountered.)
      
      For this test, we report statistics on the number of TCP timeouts,
      flow throughput, and traffic latency.
      
      1) Timeouts (total over all flows, and per flow summaries):
      
                  CUBIC            DCTCP
        Total     3227             25
        Mean       169.842          1.316
        Median     183              1
        Max        207              5
        Min        123              0
        Stddev      28.991          1.600
      
      Timeout data is taken by measuring the net change in netstat -s
      "other TCP timeouts" reported. As a result, the timeout measurements
      above are not restricted to the test traffic, and we believe that it
      is likely that all of the "DCTCP timeouts" are actually timeouts for
      non-test traffic. We report them nevertheless. CUBIC will also include
      some non-test timeouts, but they are drawfed by bona fide test traffic
      timeouts for CUBIC. Clearly DCTCP does an excellent job of preventing
      TCP timeouts. DCTCP reduces timeouts by at least two orders of
      magnitude and may well have eliminated them in this scenario.
      
      2) Throughput (per flow in Mbps):
      
                  CUBIC            DCTCP
        Mean      521.684          521.895
        Median    464              523
        Max       776              527
        Min       403              519
        Stddev    105.891            2.601
        Fairness    0.962            0.999
      
      Throughput data was simply the average throughput for each flow
      reported by iperf. By avoiding TCP timeouts, DCTCP is able to
      achieve much better per-flow results. In CUBIC, many flows
      experience TCP timeouts which makes flow throughput unpredictable and
      unfair. DCTCP, on the other hand, provides very clean predictable
      throughput without incurring TCP timeouts. Thus, the standard deviation
      of CUBIC throughput is dramatically higher than the standard deviation
      of DCTCP throughput.
      
      Mean throughput is nearly identical because even though cubic flows
      suffer TCP timeouts, other flows will step in and fill the unused
      bandwidth. Note that this test is something of a best case scenario
      for incast under CUBIC: it allows other flows to fill in for flows
      experiencing a timeout. Under situations where the receiver is issuing
      requests and then waiting for all flows to complete, flows cannot fill
      in for timed out flows and throughput will drop dramatically.
      
      3) Latency (in ms):
      
                  CUBIC            DCTCP
        Mean      4.0088           0.04219
        Median    4.055            0.0395
        Max       4.2              0.085
        Min       3.32             0.028
        Stddev    0.1666           0.01064
      
      Latency for each protocol was computed by running "ping -i 0.2
      <receiver>" from a single sender to the receiver during the incast
      test. For DCTCP, "ping -Q 0x6 -i 0.2 <receiver>" was used to ensure
      that traffic traversed the DCTCP queue and was not dropped when the
      queue size was greater than the marking threshold. The summary
      statistics above are over all ping metrics measured between the single
      sender, receiver pair.
      
      The latency results for this test show a dramatic difference between
      CUBIC and DCTCP. CUBIC intentionally overflows the switch buffer
      which incurs the maximum queue latency (more buffer memory will lead
      to high latency.) DCTCP, on the other hand, deliberately attempts to
      keep queue occupancy low. The result is a two orders of magnitude
      reduction of latency with DCTCP - even with a switch with relatively
      little RAM. Switches with larger amounts of RAM will incur increasing
      amounts of latency for CUBIC, but not for DCTCP.
      
      4) Convergence and stability test:
      
      This test measured the time that DCTCP took to fairly redistribute
      bandwidth when a new flow commences. It also measured DCTCP's ability
      to remain stable at a fair bandwidth distribution. DCTCP is compared
      with CUBIC for this test.
      
      At the commencement of this test, a single flow is sending at maximum
      rate (near 10 Gbps) to a single receiver. One second after that first
      flow commences, a new flow from a distinct server begins sending to
      the same receiver as the first flow. After the second flow has sent
      data for 10 seconds, the second flow is terminated. The first flow
      sends for an additional second. Ideally, the bandwidth would be evenly
      shared as soon as the second flow starts, and recover as soon as it
      stops.
      
      The results of this test are shown below. Note that the flow bandwidth
      for the two flows was measured near the same time, but not
      simultaneously.
      
      DCTCP performs nearly perfectly within the measurement limitations
      of this test: bandwidth is quickly distributed fairly between the two
      flows, remains stable throughout the duration of the test, and
      recovers quickly. CUBIC, in contrast, is slow to divide the bandwidth
      fairly, and has trouble remaining stable.
      
        CUBIC                      DCTCP
      
        Seconds  Flow 1  Flow 2    Seconds  Flow 1  Flow 2
         0       9.93    0          0       9.92    0
         0.5     9.87    0          0.5     9.86    0
         1       8.73    2.25       1       6.46    4.88
         1.5     7.29    2.8        1.5     4.9     4.99
         2       6.96    3.1        2       4.92    4.94
         2.5     6.67    3.34       2.5     4.93    5
         3       6.39    3.57       3       4.92    4.99
         3.5     6.24    3.75       3.5     4.94    4.74
         4       6       3.94       4       5.34    4.71
         4.5     5.88    4.09       4.5     4.99    4.97
         5       5.27    4.98       5       4.83    5.01
         5.5     4.93    5.04       5.5     4.89    4.99
         6       4.9     4.99       6       4.92    5.04
         6.5     4.93    5.1        6.5     4.91    4.97
         7       4.28    5.8        7       4.97    4.97
         7.5     4.62    4.91       7.5     4.99    4.82
         8       5.05    4.45       8       5.16    4.76
         8.5     5.93    4.09       8.5     4.94    4.98
         9       5.73    4.2        9       4.92    5.02
         9.5     5.62    4.32       9.5     4.87    5.03
        10       6.12    3.2       10       4.91    5.01
        10.5     6.91    3.11      10.5     4.87    5.04
        11       8.48    0         11       8.49    4.94
        11.5     9.87    0         11.5     9.9     0
      
      SYN/ACK ECT test:
      
      This test demonstrates the importance of ECT on SYN and SYN-ACK packets
      by measuring the connection probability in the presence of competing
      flows for a DCTCP connection attempt *without* ECT in the SYN packet.
      The test was repeated five times for each number of competing flows.
      
                    Competing Flows  1 |    2 |    4 |    8 |   16
                                     ------------------------------
      Mean Connection Probability    1 | 0.67 | 0.45 | 0.28 |    0
      Median Connection Probability  1 | 0.65 | 0.45 | 0.25 |    0
      
      As the number of competing flows moves beyond 1, the connection
      probability drops rapidly.
      
      Enabling DCTCP with this patch requires the following steps:
      
      DCTCP must be running both on the sender and receiver side in your
      data center, i.e.:
      
        sysctl -w net.ipv4.tcp_congestion_control=dctcp
      
      Also, ECN functionality must be enabled on all switches in your
      data center for DCTCP to work. The default ECN marking threshold (K)
      heuristic on the switch for DCTCP is e.g., 20 packets (30KB) at
      1Gbps, and 65 packets (~100KB) at 10Gbps (K > 1/7 * C * RTT, [4]).
      
      In above tests, for each switch port, traffic was segregated into two
      queues. For any packet with a DSCP of 0x01 - or equivalently a TOS of
      0x04 - the packet was placed into the DCTCP queue. All other packets
      were placed into the default drop-tail queue. For the DCTCP queue,
      RED/ECN marking was enabled, here, with a marking threshold of 75 KB.
      More details however, we refer you to the paper [2] under section 3).
      
      There are no code changes required to applications running in user
      space. DCTCP has been implemented in full *isolation* of the rest of
      the TCP code as its own congestion control module, so that it can run
      without a need to expose code to the core of the TCP stack, and thus
      nothing changes for non-DCTCP users.
      
      Changes in the CA framework code are minimal, and DCTCP algorithm
      operates on mechanisms that are already available in most Silicon.
      The gain (dctcp_shift_g) is currently a fixed constant (1/16) from
      the paper, but we leave the option that it can be chosen carefully
      to a different value by the user.
      
      In case DCTCP is being used and ECN support on peer site is off,
      DCTCP falls back after 3WHS to operate in normal TCP Reno mode.
      
      ss {-4,-6} -t -i diag interface:
      
        ... dctcp wscale:7,7 rto:203 rtt:2.349/0.026 mss:1448 cwnd:2054
        ssthresh:1102 ce_state 0 alpha 15 ab_ecn 0 ab_tot 735584
        send 10129.2Mbps pacing_rate 20254.1Mbps unacked:1822 retrans:0/15
        reordering:101 rcv_space:29200
      
        ... dctcp-reno wscale:7,7 rto:201 rtt:0.711/1.327 ato:40 mss:1448
        cwnd:10 ssthresh:1102 fallback_mode send 162.9Mbps pacing_rate
        325.5Mbps rcv_rtt:1.5 rcv_space:29200
      
      More information about DCTCP can be found in [1-4].
      
        [1] http://simula.stanford.edu/~alizade/Site/DCTCP.html
        [2] http://simula.stanford.edu/~alizade/Site/DCTCP_files/dctcp-final.pdf
        [3] http://simula.stanford.edu/~alizade/Site/DCTCP_files/dctcp_analysis-full.pdf
        [4] http://tools.ietf.org/html/draft-bensley-tcpm-dctcp-00
      
      Joint work with Florian Westphal and Glenn Judd.
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NGlenn Judd <glenn.judd@morganstanley.com>
      Acked-by: NStephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e3118e83
  22. 20 9月, 2014 1 次提交
    • T
      fou: Support for foo-over-udp RX path · 23461551
      Tom Herbert 提交于
      This patch provides a receive path for foo-over-udp. This allows
      direct encapsulation of IP protocols over UDP. The bound destination
      port is used to map to an IP protocol, and the XFRM framework
      (udp_encap_rcv) is used to receive encapsulated packets. Upon
      reception, the encapsulation header is logically removed (pointer
      to transport header is advanced) and the packet is reinjected into
      the receive path with the IP protocol indicated by the mapping.
      
      Netlink is used to configure FOU ports. The configuration information
      includes the port number to bind to and the IP protocol corresponding
      to that port.
      
      This should support GRE/UDP
      (http://tools.ietf.org/html/draft-yong-tsvwg-gre-in-udp-encap-02),
      as will as the other IP tunneling protocols (IPIP, SIT).
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      23461551
  23. 15 7月, 2014 1 次提交
  24. 04 9月, 2013 1 次提交
    • T
      net: neighbour: Remove CONFIG_ARPD · 3e25c65e
      Tim Gardner 提交于
      This config option is superfluous in that it only guards a call
      to neigh_app_ns(). Enabling CONFIG_ARPD by default has no
      change in behavior. There will now be call to __neigh_notify()
      for each ARP resolution, which has no impact unless there is a
      user space daemon waiting to receive the notification, i.e.,
      the case for which CONFIG_ARPD was designed anyways.
      Suggested-by: NEric W. Biederman <ebiederm@xmission.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      Cc: James Morris <jmorris@namei.org>
      Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Cc: Patrick McHardy <kaber@trash.net>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Gao feng <gaofeng@cn.fujitsu.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Veaceslav Falico <vfalico@redhat.com>
      Signed-off-by: NTim Gardner <tim.gardner@canonical.com>
      Reviewed-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3e25c65e
  25. 05 6月, 2013 1 次提交
  26. 27 3月, 2013 3 次提交
  27. 12 1月, 2013 1 次提交
    • K
      net/ipv4: remove depends on CONFIG_EXPERIMENTAL · 44fbe920
      Kees Cook 提交于
      The CONFIG_EXPERIMENTAL config item has not carried much meaning for a
      while now and is almost always enabled by default. As agreed during the
      Linux kernel summit, remove it from any "depends on" lines in Kconfigs.
      
      CC: "David S. Miller" <davem@davemloft.net>
      CC: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      CC: James Morris <jmorris@namei.org>
      CC: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      CC: Patrick McHardy <kaber@trash.net>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      44fbe920
  28. 19 7月, 2012 1 次提交
  29. 16 5月, 2012 2 次提交
  30. 08 2月, 2012 1 次提交
  31. 08 1月, 2012 1 次提交
  32. 11 12月, 2011 1 次提交
  33. 10 12月, 2011 1 次提交
  34. 31 10月, 2011 1 次提交
    • P
      Kconfig: remove a few puzzling comments · bfc994b5
      Paul Bolle 提交于
      These comments mention CONFIG options that do not exist: not as a symbol
      in a Kconfig file (without the CONFIG_ prefix) and neither as a symbol
      (with that prefix) in the code.
      
      There's one reference to XSCALE_PMU_TIMER as a negative dependency.
      But XSCALE_PMU_TIMER is never defined (CONFIG_XSCALE_PMU_TIMER is
      also unused in the code). It shows up with type "unknown" if you search
      for it in menuconfig. Apparently a negative dependency on an unknown
      symbol is always true. That negative dependency can be removed too.
      Signed-off-by: NPaul Bolle <pebolle@tiscali.nl>
      Signed-off-by: NJiri Kosina <jkosina@suse.cz>
      bfc994b5
  35. 02 2月, 2011 1 次提交
    • D
      ipv4: Remove fib_hash. · 3630b7c0
      David S. Miller 提交于
      The time has finally come to remove the hash based routing table
      implementation in ipv4.
      
      FIB Trie is mature, well tested, and I've done an audit of it's code
      to confirm that it implements insert, delete, and lookup with the same
      identical semantics as fib_hash did.
      
      If there are any semantic differences found in fib_trie, we should
      simply fix them.
      
      I've placed the trie statistic config option under advanced router
      configuration.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Acked-by: NStephen Hemminger <shemminger@vyatta.com>
      3630b7c0
  36. 14 1月, 2011 1 次提交
    • P
      netfilter: fix Kconfig dependencies · c7066f70
      Patrick McHardy 提交于
      Fix dependencies of netfilter realm match: it depends on NET_CLS_ROUTE,
      which itself depends on NET_SCHED; this dependency is missing from netfilter.
      
      Since matching on realms is also useful without having NET_SCHED enabled and
      the option really only controls whether the tclassid member is included in
      route and dst entries, rename the config option to IP_ROUTE_CLASSID and move
      it outside of traffic scheduling context to get rid of the NET_SCHED dependeny.
      Reported-by: NVladis Kletnieks <Valdis.Kletnieks@vt.edu>
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      c7066f70
  37. 16 11月, 2010 1 次提交
    • M
      Docs/Kconfig: Update: osdl.org -> linuxfoundation.org · c996d8b9
      Michael Witten 提交于
      Some of the documentation refers to web pages under
      the domain `osdl.org'. However, `osdl.org' now
      redirects to `linuxfoundation.org'.
      
      Rather than rely on redirections, this patch updates
      the addresses appropriately; for the most part, only
      documentation that is meant to be current has been
      updated.
      
      The patch should be pretty quick to scan and check;
      each new web-page url was gotten by trying out the
      original URL in a browser and then simply copying the
      the redirected URL (formatting as necessary).
      
      There is some conflict as to which one of these domain
      names is preferred:
      
        linuxfoundation.org
        linux-foundation.org
      
      So, I wrote:
      
        info@linuxfoundation.org
      
      and got this reply:
      
        Message-ID: <4CE17EE6.9040807@linuxfoundation.org>
        Date: Mon, 15 Nov 2010 10:41:42 -0800
        From: David Ames <david@linuxfoundation.org>
      
        ...
      
        linuxfoundation.org is preferred. The canonical name for our web site is
        www.linuxfoundation.org. Our list site is actually
        lists.linux-foundation.org.
      
        Regarding email linuxfoundation.org is preferred there are a few people
        who choose to use linux-foundation.org for their own reasons.
      
      Consequently, I used `linuxfoundation.org' for web pages and
      `lists.linux-foundation.org' for mailing-list web pages and email addresses;
      the only personal email address I updated from `@osdl.org' was that of
      Andrew Morton, who prefers `linux-foundation.org' according `git log'.
      Signed-off-by: NMichael Witten <mfwitten@gmail.com>
      Signed-off-by: NJiri Kosina <jkosina@suse.cz>
      c996d8b9