- 09 3月, 2015 1 次提交
-
-
由 stephen hemminger 提交于
Remove reference to obsolete ifconfig command. MTU can be changed with ip command instead. Signed-off-by: NStephen Hemminger <stephen@networkplumber.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 07 3月, 2015 1 次提交
-
-
由 Fan Du 提交于
Namely tcp_probe_interval to control how often to restart a probe. And tcp_probe_threshold to control when stop the probing in respect to the width of search range in bytes Signed-off-by: NFan Du <fan.du@intel.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 04 3月, 2015 1 次提交
-
-
由 Eric W. Biederman 提交于
This sysctl gives two benefits. By defaulting the table size to 0 mpls even when compiled in and enabled defaults to not forwarding any packets. This prevents unpleasant surprises for users. The other benefit is that as mpls labels are allocated locally a dense table a small dense label table may be used which saves memory and is extremely simple and efficient to implement. This sysctl allows userspace to choose the restrictions on the label table size userspace applications need to cope with. Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 25 2月, 2015 1 次提交
-
-
由 Marcelo Ricardo Leitner 提交于
Currently, when TCP/SCTP port reusing happens, IPVS will find the old entry and use it for the new one, behaving like a forced persistence. But if you consider a cluster with a heavy load of small connections, such reuse will happen often and may lead to a not optimal load balancing and might prevent a new node from getting a fair load. This patch introduces a new sysctl, conn_reuse_mode, that allows controlling how to proceed when port reuse is detected. The default value will allow rescheduling of new connections only if the old entry was in TIME_WAIT state for TCP or CLOSED for SCTP. Signed-off-by: NMarcelo Ricardo Leitner <mleitner@redhat.com> Signed-off-by: NJulian Anastasov <ja@ssi.bg> Signed-off-by: NSimon Horman <horms@verge.net.au>
-
- 24 2月, 2015 4 次提交
-
-
由 Ben Hutchings 提交于
Drop the '.o' suffix so this text properly covers both the built-in and modular cases. 'insmod pktgen' obviously won't work; the command should be modprobe. Signed-off-by: NBen Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Ben Hutchings 提交于
These are Robert Olsson's samples which used to be available from <ftp://robur.slu.se/pub/Linux/net-development/pktgen-testing/examples/> but currently are not. Change the documentation to refer to these consistently as 'sample scripts', matching the directory name used here. Cc: Robert Olsson <robert@herjulf.se> Signed-off-by: NBen Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Ben Hutchings 提交于
Thanks to Rob Jones for suggesting some of the changes. Cc: Rob Jones <rob.jones@codethink.co.uk> Signed-off-by: NBen Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Ben Hutchings 提交于
This has been updated quite a few times since 2004, and git can keep track of the actual date for us. Signed-off-by: NBen Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 08 2月, 2015 1 次提交
-
-
由 Neal Cardwell 提交于
Helpers for mitigating ACK loops by rate-limiting dupacks sent in response to incoming out-of-window packets. This patch includes: - rate-limiting logic - sysctl to control how often we allow dupacks to out-of-window packets - SNMP counter for cases where we rate-limited our dupack sending The rate-limiting logic in this patch decides to not send dupacks in response to out-of-window segments if (a) they are SYNs or pure ACKs and (b) the remote endpoint is sending them faster than the configured rate limit. We rate-limit our responses rather than blocking them entirely or resetting the connection, because legitimate connections can rely on dupacks in response to some out-of-window segments. For example, zero window probes are typically sent with a sequence number that is below the current window, and ZWPs thus expect to thus elicit a dupack in response. We allow dupacks in response to TCP segments with data, because these may be spurious retransmissions for which the remote endpoint wants to receive DSACKs. This is safe because segments with data can't realistically be part of ACK loops, which by their nature consist of each side sending pure/data-less ACKs to each other. The dupack interval is controlled by a new sysctl knob, tcp_invalid_ratelimit, given in milliseconds, in case an administrator needs to dial this upward in the face of a high-rate DoS attack. The name and units are chosen to be analogous to the existing analogous knob for ICMP, icmp_ratelimit. The default value for tcp_invalid_ratelimit is 500ms, which allows at most one such dupack per 500ms. This is chosen to be 2x faster than the 1-second minimum RTO interval allowed by RFC 6298 (section 2, rule 2.4). We allow the extra 2x factor because network delay variations can cause packets sent at 1 second intervals to be compressed and arrive much closer. Reported-by: NAvery Fay <avery@mixpanel.com> Signed-off-by: NNeal Cardwell <ncardwell@google.com> Signed-off-by: NYuchung Cheng <ycheng@google.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 05 2月, 2015 1 次提交
-
-
由 Stefan Tatschner 提交于
<linux/can/error.h> moved in the big UAPI shuffle; update the document to note its new location. Signed-off-by: NStefan Tatschner <stefan@sevenbyte.org> [jc: added changelog] Signed-off-by: NJonathan Corbet <corbet@lwn.net>
-
- 03 2月, 2015 3 次提交
-
-
由 Richard Weinberger 提交于
Update netlink_mmap.txt wrt. commit 4682a035 ("netlink: Always copy on mmap TX."). Signed-off-by: NRichard Weinberger <richard@nod.at> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Willem de Bruijn 提交于
Demonstrate how SOF_TIMESTAMPING_OPT_TSONLY can be used and test the implementation. Signed-off-by: NWillem de Bruijn <willemb@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Willem de Bruijn 提交于
Add timestamping option SOF_TIMESTAMPING_OPT_TSONLY. For transmit timestamps, this loops timestamps on top of empty packets. Doing so reduces the pressure on SO_RCVBUF. Payload inspection and cmsg reception (aside from timestamps) are no longer possible. This works together with a follow on patch that allows administrators to only allow tx timestamping if it does not loop payload or metadata. Signed-off-by: NWillem de Bruijn <willemb@google.com> ---- Changes (rfc -> v1) - add documentation - remove unnecessary skb->len test (thanks to Richard Cochran) Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 27 1月, 2015 1 次提交
-
-
由 Joe Stringer 提交于
Previously, flows were manipulated by userspace specifying a full, unmasked flow key. This adds significant burden onto flow serialization/deserialization, particularly when dumping flows. This patch adds an alternative way to refer to flows using a variable-length "unique flow identifier" (UFID). At flow setup time, userspace may specify a UFID for a flow, which is stored with the flow and inserted into a separate table for lookup, in addition to the standard flow table. Flows created using a UFID must be fetched or deleted using the UFID. All flow dump operations may now be made more terse with OVS_UFID_F_* flags. For example, the OVS_UFID_F_OMIT_KEY flag allows responses to omit the flow key from a datapath operation if the flow has a corresponding UFID. This significantly reduces the time spent assembling and transacting netlink messages. With all OVS_UFID_F_OMIT_* flags enabled, the datapath only returns the UFID and statistics for each flow during flow dump, increasing ovs-vswitchd revalidator performance by 40% or more. Signed-off-by: NJoe Stringer <joestringer@nicira.com> Acked-by: NPravin B Shelar <pshelar@nicira.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 26 1月, 2015 1 次提交
-
-
由 Harout Hedeshian 提交于
The kernel forcefully applies MTU values received in router advertisements provided the new MTU is less than the current. This behavior is undesirable when the user space is managing the MTU. Instead a sysctl flag 'accept_ra_mtu' is introduced such that the user space can control whether or not RA provided MTU updates should be applied. The default behavior is unchanged; user space must explicitly set this flag to 0 for RA MTUs to be ignored. Signed-off-by: NHarout Hedeshian <harouth@codeaurora.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 14 1月, 2015 1 次提交
-
-
由 Jiri Pirko 提交于
The same macros are used for rx as well. So rename it. Signed-off-by: NJiri Pirko <jiri@resnulli.us> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 13 1月, 2015 1 次提交
-
-
由 Ani Sinha 提交于
Update documentation to reflect the fact that /proc/sys/net/ipv4/route/max_size is no longer used for ipv4. Signed-off-by: NAni Sinha <ani@arista.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 12 1月, 2015 1 次提交
-
-
由 Willem de Bruijn 提交于
A fix to ipv6 structure definitions removed the now superfluous definition of in6_pktinfo in this file. But, use of the glibc definition requires defining _GNU_SOURCE (see also https://sourceware.org/bugzilla/show_bug.cgi?id=6775). Before this change, the following would fail for me: make make headers_install make M=Documentation/networking/timestamping with Documentation/networking/timestamping/txtimestamp.c: In function '__recv_errmsg_cmsg': Documentation/networking/timestamping/txtimestamp.c:205:33: error: dereferencing pointer to incomplete type Documentation/networking/timestamping/txtimestamp.c:206:23: error: dereferencing pointer to incomplete type After this patch compilation succeeded. Fixes: cd91cc5b ("doc: fix the compile error of txtimestamp.c") Signed-off-by: NWillem de Bruijn <willemb@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 09 1月, 2015 1 次提交
-
-
由 WANG Cong 提交于
Vinson reported: HOSTCC Documentation/networking/timestamping/txtimestamp Documentation/networking/timestamping/txtimestamp.c:64:8: error: redefinition of ‘struct in6_pktinfo’ struct in6_pktinfo { ^ In file included from /usr/include/arpa/inet.h:23:0, from Documentation/networking/timestamping/txtimestamp.c:33: /usr/include/netinet/in.h:456:8: note: originally defined here struct in6_pktinfo ^ After we sync with libc header, we don't need this ugly hack any more. Reported-by: NVinson Lee <vlee@twopensource.com> Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 30 12月, 2014 1 次提交
-
-
由 Henrik Austad 提交于
- altera_tse.txt was added by 04add4ab (Add Altera Ethernet (TSE) Documentation) - cdc_mbim.txt was added by a563babe (cdc_mbim: add driver documentation) - dctcp.txt was added by e3118e83 (tcp: add DCTCP congestion control algorithm) CC: Jonathan Corbet <corbet@lwn.net> CC: "David S. Miller" <davem@davemloft.net> CC: linux-doc@vger.kernel.org CC: linux-kernel@vger.kernel.org Signed-off-by: NHenrik Austad <henrik@austad.us> Signed-off-by: NJonathan Corbet <corbet@lwn.net>
-
- 23 12月, 2014 1 次提交
-
-
由 Marcelo Leitner 提交于
Manually bumping either nf_conntrack_buckets or nf_conntrack_max has become a common task as our Linux servers tend to serve more and more clients/applications, so let's adjust nf_conntrack_buckets this to a more updated value. Now for systems with more than 4GB of memory, nf_conntrack_buckets becomes 65536 instead of 16384, resulting in nf_conntrack_max=256k entries. Signed-off-by: NMarcelo Ricardo Leitner <mleitner@redhat.com> Acked-by: NJesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
-
- 16 12月, 2014 1 次提交
-
-
由 Duan Jiong 提交于
Fix the typo, there should be "It". On the other hand, fix whitespace errors detected by checkpatch.pl Signed-off-by: NDuan Jiong <duanj.fnst@cn.fujitsu.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 10 12月, 2014 1 次提交
-
-
由 Rami Rosen 提交于
This patch fixes the erronous usage of an hexadecimal address in the example, by replacing it with a decimal address. Signed-off-by: NRami Rosen <ramirose@gmail.com> Acked-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 09 12月, 2014 2 次提交
-
-
由 Willem de Bruijn 提交于
Documentation: expand explanation of timestamp counter Test: new: flag -I requests and prints PKTINFO new: flag -x prints payload (possibly truncated) fix: remove pretty print that breaks common flag '-l 1' Signed-off-by: NWillem de Bruijn <willemb@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Willem de Bruijn 提交于
Allow reading of timestamps and cmsg at the same time on all relevant socket families. One use is to correlate timestamps with egress device, by asking for cmsg IP_PKTINFO. on AF_INET sockets, call the relevant function (ip_cmsg_recv). To avoid changing legacy expectations, only do so if the caller sets a new timestamping flag SOF_TIMESTAMPING_OPT_CMSG. on AF_INET6 sockets, IPV6_PKTINFO and all other recv cmsg are already returned for all origins. only change is to set ifindex, which is not initialized for all error origins. In both cases, only generate the pktinfo message if an ifindex is known. This is not the case for ACK timestamps. The difference between the protocol families is probably a historical accident as a result of the different conditions for generating cmsg in the relevant ip(v6)_recv_error function: ipv4: if (serr->ee.ee_origin == SO_EE_ORIGIN_ICMP) { ipv6: if (serr->ee.ee_origin != SO_EE_ORIGIN_LOCAL) { At one time, this was the same test bar for the ICMP/ICMP6 distinction. This is no longer true. Signed-off-by: NWillem de Bruijn <willemb@google.com> ---- Changes v1 -> v2 large rewrite - integrate with existing pktinfo cmsg generation code - on ipv4: only send with new flag, to maintain legacy behavior - on ipv6: send at most a single pktinfo cmsg - on ipv6: initialize fields if not yet initialized The recv cmsg interfaces are also relevant to the discussion of whether looping packet headers is problematic. For v6, cmsgs that identify many headers are already returned. This patch expands that to v4. If it sounds reasonable, I will follow with patches 1. request timestamps without payload with SOF_TIMESTAMPING_OPT_TSONLY (http://patchwork.ozlabs.org/patch/366967/) 2. sysctl to conditionally drop all timestamps that have payload or cmsg from users without CAP_NET_RAW. Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 03 12月, 2014 1 次提交
-
-
由 Jiri Pirko 提交于
The goal of this is to provide a possibility to support various switch chips. Drivers should implement relevant ndos to do so. Now there is only one ndo defined: - for getting physical switch id is in place. Note that user can use random port netdevice to access the switch. Signed-off-by: NJiri Pirko <jiri@resnulli.us> Reviewed-by: NThomas Graf <tgraf@suug.ch> Acked-by: NAndy Gospodarek <gospo@cumulusnetworks.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 26 11月, 2014 1 次提交
-
-
由 Andrew Lutomirski 提交于
SOF_TIMESTAMPING_OPT_ID puts the id in ee_data, not ee_info. Cc: Willem de Bruijn <willemb@google.com> Signed-off-by: NAndy Lutomirski <luto@amacapital.net> Acked-by: NWillem de Bruijn <willemb@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 25 11月, 2014 1 次提交
-
-
由 Mahesh Bandewar 提交于
This driver is very similar to the macvlan driver except that it uses L3 on the frame to determine the logical interface while functioning as packet dispatcher. It inherits L2 of the master device hence the packets on wire will have the same L2 for all the packets originating from all virtual devices off of the same master device. This driver was developed keeping the namespace use-case in mind. Hence most of the examples given here take that as the base setup where main-device belongs to the default-ns and virtual devices are assigned to the additional namespaces. The device operates in two different modes and the difference in these two modes in primarily in the TX side. (a) L2 mode : In this mode, the device behaves as a L2 device. TX processing upto L2 happens on the stack of the virtual device associated with (namespace). Packets are switched after that into the main device (default-ns) and queued for xmit. RX processing is simple and all multicast, broadcast (if applicable), and unicast belonging to the address(es) are delivered to the virtual devices. (b) L3 mode : In this mode, the device behaves like a L3 device. TX processing upto L3 happens on the stack of the virtual device associated with (namespace). Packets are switched to the main-device (default-ns) for the L2 processing. Hence the routing table of the default-ns will be used in this mode. RX processins is somewhat similar to the L2 mode except that in this mode only Unicast packets are delivered to the virtual device while main-dev will handle all other packets. The devices can be added using the "ip" command from the iproute2 package - ip link add link <master> <virtual> type ipvlan mode [ l2 | l3 ] Signed-off-by: NMahesh Bandewar <maheshb@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Maciej Żenczykowski <maze@google.com> Cc: Laurent Chavey <chavey@google.com> Cc: Tim Hockin <thockin@google.com> Cc: Brandon Philips <brandon.philips@coreos.com> Cc: Pavel Emelianov <xemul@parallels.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 20 11月, 2014 1 次提交
-
-
由 Giuseppe CAVALLARO 提交于
Recently many changes have been done inside the driver so this patch updates the driver's doc for example reviewing information for the rx and tx processes that are managed by napi method, adding new information for missing glue-logic files etc. Signed-off-by: NGiuseppe Cavallaro <peppe.cavallaro@st.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 06 11月, 2014 1 次提交
-
-
由 Loganaden Velvindron 提交于
It was initially sent by Lorenzo Colitti, but was subsequently lost in the final diff he submitted. Signed-off-by: NLoganaden Velvindron <logan@elandsys.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 30 10月, 2014 2 次提交
-
-
由 Erik Kline 提交于
Add a sysctl that causes an interface's optimistic addresses to be considered equivalent to other non-deprecated addresses for source address selection purposes. Preferred addresses will still take precedence over optimistic addresses, subject to other ranking in the source address selection algorithm. This is useful where different interfaces are connected to different networks from different ISPs (e.g., a cell network and a home wifi network). The current behaviour complies with RFC 3484/6724, and it makes sense if the host has only one interface, or has multiple interfaces on the same network (same or cooperating administrative domain(s), but not in the multiple distinct networks case. For example, if a mobile device has an IPv6 address on an LTE network and then connects to IPv6-enabled wifi, while the wifi IPv6 address is undergoing DAD, IPv6 connections will try use the wifi default route with the LTE IPv6 address, and will get stuck until they time out. Also, because optimistic nodes can receive frames, issue an RTM_NEWADDR as soon as DAD starts (with the IFA_F_OPTIMSTIC flag appropriately set). A second RTM_NEWADDR is sent if DAD completes (the address flags have changed), otherwise an RTM_DELADDR is sent. Also: add an entry in ip-sysctl.txt for optimistic_dad. Signed-off-by: NErik Kline <ek@google.com> Acked-by: NLorenzo Colitti <lorenzo@google.com> Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
While testing upcoming Yaogong patch (converting out of order queue into an RB tree), I hit the max reordering level of linux TCP stack. Reordering level was limited to 127 for no good reason, and some network setups [1] can easily reach this limit and get limited throughput. Allow a new max limit of 300, and add a sysctl to allow admins to even allow bigger (or lower) values if needed. [1] Aggregation of links, per packet load balancing, fabrics not doing deep packet inspections, alternative TCP congestion modules... Signed-off-by: NEric Dumazet <edumazet@google.com> Cc: Yaogong Wang <wygivan@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 11 10月, 2014 1 次提交
-
-
由 Li RongQing 提交于
__sk_run_filter has been renamed as __bpf_prog_run, so replace them in comments Signed-off-by: NLi RongQing <roy.qing.li@gmail.com> Acked-by: NAlexei Starovoitov <ast@plumgrid.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 02 10月, 2014 1 次提交
-
-
由 Alexei Starovoitov 提交于
This patch demonstrates the effect of delaying update of HW tailptr. (based on earlier patch by Jesper) burst=1 is the default. It sends one packet with xmit_more=false burst=2 sends one packet with xmit_more=true and 2nd copy of the same packet with xmit_more=false burst=3 sends two copies of the same packet with xmit_more=true and 3rd copy with xmit_more=false Performance with ixgbe (usec 30): burst=1 tx:9.2 Mpps burst=2 tx:13.5 Mpps burst=3 tx:14.5 Mpps full 10G line rate Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Acked-by: NJesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 29 9月, 2014 1 次提交
-
-
由 Daniel Borkmann 提交于
This work adds the DataCenter TCP (DCTCP) congestion control algorithm [1], which has been first published at SIGCOMM 2010 [2], resp. follow-up analysis at SIGMETRICS 2011 [3] (and also, more recently as an informational IETF draft available at [4]). DCTCP is an enhancement to the TCP congestion control algorithm for data center networks. Typical data center workloads are i.e. i) partition/aggregate (queries; bursty, delay sensitive), ii) short messages e.g. 50KB-1MB (for coordination and control state; delay sensitive), and iii) large flows e.g. 1MB-100MB (data update; throughput sensitive). DCTCP has therefore been designed for such environments to provide/achieve the following three requirements: * High burst tolerance (incast due to partition/aggregate) * Low latency (short flows, queries) * High throughput (continuous data updates, large file transfers) with commodity, shallow buffered switches The basic idea of its design consists of two fundamentals: i) on the switch side, packets are being marked when its internal queue length > threshold K (K is chosen so that a large enough headroom for marked traffic is still available in the switch queue); ii) the sender/host side maintains a moving average of the fraction of marked packets, so each RTT, F is being updated as follows: F := X / Y, where X is # of marked ACKs, Y is total # of ACKs alpha := (1 - g) * alpha + g * F, where g is a smoothing constant The resulting alpha (iow: probability that switch queue is congested) is then being used in order to adaptively decrease the congestion window W: W := (1 - (alpha / 2)) * W The means for receiving marked packets resp. marking them on switch side in DCTCP is the use of ECN. RFC3168 describes a mechanism for using Explicit Congestion Notification from the switch for early detection of congestion, rather than waiting for segment loss to occur. However, this method only detects the presence of congestion, not the *extent*. In the presence of mild congestion, it reduces the TCP congestion window too aggressively and unnecessarily affects the throughput of long flows [4]. DCTCP, as mentioned, enhances Explicit Congestion Notification (ECN) processing to estimate the fraction of bytes that encounter congestion, rather than simply detecting that some congestion has occurred. DCTCP then scales the TCP congestion window based on this estimate [4], thus it can derive multibit feedback from the information present in the single-bit sequence of marks in its control law. And thus act in *proportion* to the extent of congestion, not its *presence*. Switches therefore set the Congestion Experienced (CE) codepoint in packets when internal queue lengths exceed threshold K. Resulting, DCTCP delivers the same or better throughput than normal TCP, while using 90% less buffer space. It was found in [2] that DCTCP enables the applications to handle 10x the current background traffic, without impacting foreground traffic. Moreover, a 10x increase in foreground traffic did not cause any timeouts, and thus largely eliminates TCP incast collapse problems. The algorithm itself has already seen deployments in large production data centers since then. We did a long-term stress-test and analysis in a data center, short summary of our TCP incast tests with iperf compared to cubic: This test measured DCTCP throughput and latency and compared it with CUBIC throughput and latency for an incast scenario. In this test, 19 senders sent at maximum rate to a single receiver. The receiver simply ran iperf -s. The senders ran iperf -c <receiver> -t 30. All senders started simultaneously (using local clocks synchronized by ntp). This test was repeated multiple times. Below shows the results from a single test. Other tests are similar. (DCTCP results were extremely consistent, CUBIC results show some variance induced by the TCP timeouts that CUBIC encountered.) For this test, we report statistics on the number of TCP timeouts, flow throughput, and traffic latency. 1) Timeouts (total over all flows, and per flow summaries): CUBIC DCTCP Total 3227 25 Mean 169.842 1.316 Median 183 1 Max 207 5 Min 123 0 Stddev 28.991 1.600 Timeout data is taken by measuring the net change in netstat -s "other TCP timeouts" reported. As a result, the timeout measurements above are not restricted to the test traffic, and we believe that it is likely that all of the "DCTCP timeouts" are actually timeouts for non-test traffic. We report them nevertheless. CUBIC will also include some non-test timeouts, but they are drawfed by bona fide test traffic timeouts for CUBIC. Clearly DCTCP does an excellent job of preventing TCP timeouts. DCTCP reduces timeouts by at least two orders of magnitude and may well have eliminated them in this scenario. 2) Throughput (per flow in Mbps): CUBIC DCTCP Mean 521.684 521.895 Median 464 523 Max 776 527 Min 403 519 Stddev 105.891 2.601 Fairness 0.962 0.999 Throughput data was simply the average throughput for each flow reported by iperf. By avoiding TCP timeouts, DCTCP is able to achieve much better per-flow results. In CUBIC, many flows experience TCP timeouts which makes flow throughput unpredictable and unfair. DCTCP, on the other hand, provides very clean predictable throughput without incurring TCP timeouts. Thus, the standard deviation of CUBIC throughput is dramatically higher than the standard deviation of DCTCP throughput. Mean throughput is nearly identical because even though cubic flows suffer TCP timeouts, other flows will step in and fill the unused bandwidth. Note that this test is something of a best case scenario for incast under CUBIC: it allows other flows to fill in for flows experiencing a timeout. Under situations where the receiver is issuing requests and then waiting for all flows to complete, flows cannot fill in for timed out flows and throughput will drop dramatically. 3) Latency (in ms): CUBIC DCTCP Mean 4.0088 0.04219 Median 4.055 0.0395 Max 4.2 0.085 Min 3.32 0.028 Stddev 0.1666 0.01064 Latency for each protocol was computed by running "ping -i 0.2 <receiver>" from a single sender to the receiver during the incast test. For DCTCP, "ping -Q 0x6 -i 0.2 <receiver>" was used to ensure that traffic traversed the DCTCP queue and was not dropped when the queue size was greater than the marking threshold. The summary statistics above are over all ping metrics measured between the single sender, receiver pair. The latency results for this test show a dramatic difference between CUBIC and DCTCP. CUBIC intentionally overflows the switch buffer which incurs the maximum queue latency (more buffer memory will lead to high latency.) DCTCP, on the other hand, deliberately attempts to keep queue occupancy low. The result is a two orders of magnitude reduction of latency with DCTCP - even with a switch with relatively little RAM. Switches with larger amounts of RAM will incur increasing amounts of latency for CUBIC, but not for DCTCP. 4) Convergence and stability test: This test measured the time that DCTCP took to fairly redistribute bandwidth when a new flow commences. It also measured DCTCP's ability to remain stable at a fair bandwidth distribution. DCTCP is compared with CUBIC for this test. At the commencement of this test, a single flow is sending at maximum rate (near 10 Gbps) to a single receiver. One second after that first flow commences, a new flow from a distinct server begins sending to the same receiver as the first flow. After the second flow has sent data for 10 seconds, the second flow is terminated. The first flow sends for an additional second. Ideally, the bandwidth would be evenly shared as soon as the second flow starts, and recover as soon as it stops. The results of this test are shown below. Note that the flow bandwidth for the two flows was measured near the same time, but not simultaneously. DCTCP performs nearly perfectly within the measurement limitations of this test: bandwidth is quickly distributed fairly between the two flows, remains stable throughout the duration of the test, and recovers quickly. CUBIC, in contrast, is slow to divide the bandwidth fairly, and has trouble remaining stable. CUBIC DCTCP Seconds Flow 1 Flow 2 Seconds Flow 1 Flow 2 0 9.93 0 0 9.92 0 0.5 9.87 0 0.5 9.86 0 1 8.73 2.25 1 6.46 4.88 1.5 7.29 2.8 1.5 4.9 4.99 2 6.96 3.1 2 4.92 4.94 2.5 6.67 3.34 2.5 4.93 5 3 6.39 3.57 3 4.92 4.99 3.5 6.24 3.75 3.5 4.94 4.74 4 6 3.94 4 5.34 4.71 4.5 5.88 4.09 4.5 4.99 4.97 5 5.27 4.98 5 4.83 5.01 5.5 4.93 5.04 5.5 4.89 4.99 6 4.9 4.99 6 4.92 5.04 6.5 4.93 5.1 6.5 4.91 4.97 7 4.28 5.8 7 4.97 4.97 7.5 4.62 4.91 7.5 4.99 4.82 8 5.05 4.45 8 5.16 4.76 8.5 5.93 4.09 8.5 4.94 4.98 9 5.73 4.2 9 4.92 5.02 9.5 5.62 4.32 9.5 4.87 5.03 10 6.12 3.2 10 4.91 5.01 10.5 6.91 3.11 10.5 4.87 5.04 11 8.48 0 11 8.49 4.94 11.5 9.87 0 11.5 9.9 0 SYN/ACK ECT test: This test demonstrates the importance of ECT on SYN and SYN-ACK packets by measuring the connection probability in the presence of competing flows for a DCTCP connection attempt *without* ECT in the SYN packet. The test was repeated five times for each number of competing flows. Competing Flows 1 | 2 | 4 | 8 | 16 ------------------------------ Mean Connection Probability 1 | 0.67 | 0.45 | 0.28 | 0 Median Connection Probability 1 | 0.65 | 0.45 | 0.25 | 0 As the number of competing flows moves beyond 1, the connection probability drops rapidly. Enabling DCTCP with this patch requires the following steps: DCTCP must be running both on the sender and receiver side in your data center, i.e.: sysctl -w net.ipv4.tcp_congestion_control=dctcp Also, ECN functionality must be enabled on all switches in your data center for DCTCP to work. The default ECN marking threshold (K) heuristic on the switch for DCTCP is e.g., 20 packets (30KB) at 1Gbps, and 65 packets (~100KB) at 10Gbps (K > 1/7 * C * RTT, [4]). In above tests, for each switch port, traffic was segregated into two queues. For any packet with a DSCP of 0x01 - or equivalently a TOS of 0x04 - the packet was placed into the DCTCP queue. All other packets were placed into the default drop-tail queue. For the DCTCP queue, RED/ECN marking was enabled, here, with a marking threshold of 75 KB. More details however, we refer you to the paper [2] under section 3). There are no code changes required to applications running in user space. DCTCP has been implemented in full *isolation* of the rest of the TCP code as its own congestion control module, so that it can run without a need to expose code to the core of the TCP stack, and thus nothing changes for non-DCTCP users. Changes in the CA framework code are minimal, and DCTCP algorithm operates on mechanisms that are already available in most Silicon. The gain (dctcp_shift_g) is currently a fixed constant (1/16) from the paper, but we leave the option that it can be chosen carefully to a different value by the user. In case DCTCP is being used and ECN support on peer site is off, DCTCP falls back after 3WHS to operate in normal TCP Reno mode. ss {-4,-6} -t -i diag interface: ... dctcp wscale:7,7 rto:203 rtt:2.349/0.026 mss:1448 cwnd:2054 ssthresh:1102 ce_state 0 alpha 15 ab_ecn 0 ab_tot 735584 send 10129.2Mbps pacing_rate 20254.1Mbps unacked:1822 retrans:0/15 reordering:101 rcv_space:29200 ... dctcp-reno wscale:7,7 rto:201 rtt:0.711/1.327 ato:40 mss:1448 cwnd:10 ssthresh:1102 fallback_mode send 162.9Mbps pacing_rate 325.5Mbps rcv_rtt:1.5 rcv_space:29200 More information about DCTCP can be found in [1-4]. [1] http://simula.stanford.edu/~alizade/Site/DCTCP.html [2] http://simula.stanford.edu/~alizade/Site/DCTCP_files/dctcp-final.pdf [3] http://simula.stanford.edu/~alizade/Site/DCTCP_files/dctcp_analysis-full.pdf [4] http://tools.ietf.org/html/draft-bensley-tcpm-dctcp-00 Joint work with Florian Westphal and Glenn Judd. Signed-off-by: NDaniel Borkmann <dborkman@redhat.com> Signed-off-by: NFlorian Westphal <fw@strlen.de> Signed-off-by: NGlenn Judd <glenn.judd@morganstanley.com> Acked-by: NStephen Hemminger <stephen@networkplumber.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 28 9月, 2014 1 次提交
-
-
由 Dan Williams 提交于
Per commit "77873803 net_dma: mark broken" net_dma is no longer used and there is no plan to fix it. This is the mechanical removal of bits in CONFIG_NET_DMA ifdef guards. Reverting the remainder of the net_dma induced changes is deferred to subsequent patches. Marked for stable due to Roman's report of a memory leak in dma_pin_iovec_pages(): https://lkml.org/lkml/2014/9/3/177 Cc: Dave Jiang <dave.jiang@intel.com> Cc: Vinod Koul <vinod.koul@intel.com> Cc: David Whipple <whipple@securedatainnovations.ch> Cc: Alexander Duyck <alexander.h.duyck@intel.com> Cc: <stable@vger.kernel.org> Reported-by: NRoman Gushchin <klamm@yandex-team.ru> Acked-by: NDavid S. Miller <davem@davemloft.net> Signed-off-by: NDan Williams <dan.j.williams@intel.com>
-
- 27 9月, 2014 2 次提交
-
-
由 Alexei Starovoitov 提交于
this patch adds all of eBPF verfier documentation and empty bpf_check() The end goal for the verifier is to statically check safety of the program. Verifier will catch: - loops - out of range jumps - unreachable instructions - invalid instructions - uninitialized register access - uninitialized stack access - misaligned stack access - out of range stack access - invalid calling convention More details in Documentation/networking/filter.txt Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Alexei Starovoitov 提交于
BPF syscall is a multiplexor for a range of different operations on eBPF. This patch introduces syscall with single command to create a map. Next patch adds commands to access maps. 'maps' is a generic storage of different types for sharing data between kernel and userspace. Userspace example: /* this syscall wrapper creates a map with given type and attributes * and returns map_fd on success. * use close(map_fd) to delete the map */ int bpf_create_map(enum bpf_map_type map_type, int key_size, int value_size, int max_entries) { union bpf_attr attr = { .map_type = map_type, .key_size = key_size, .value_size = value_size, .max_entries = max_entries }; return bpf(BPF_MAP_CREATE, &attr, sizeof(attr)); } 'union bpf_attr' is backwards compatible with future extensions. More details in Documentation/networking/filter.txt and in manpage Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 26 9月, 2014 2 次提交
-
-
由 Peter Foley 提交于
Remove empty networking/.gitignore Signed-off-by: NPeter Foley <pefoley2@pefoley.com> Cc: rdunlap@infradead.org Cc: linux-doc@vger.kernel.org Signed-off-by: NRandy Dunlap <rdunlap@infradead.org> Signed-off-by: NJiri Kosina <jkosina@suse.cz>
-
由 Peter Foley 提交于
Add some missing files to .gitignore. Push Documentation/.gitignore down into subdirectories. Signed-off-by: NPeter Foley <pefoley2@pefoley.com> Signed-off-by: NRandy Dunlap <rdunlap@infradead.org> Signed-off-by: NJiri Kosina <jkosina@suse.cz>
-