- 02 5月, 2019 30 次提交
-
-
由 Md Fahad Iqbal Polash 提交于
Runtime change of PFINT_OICR_ENA register is unnecessary. The handlers should always clear the atomic bit for each task as they start, because it will make sure that any late interrupt will either 1) re-set the bit, or 2) be handled directly in the "already running" task handler. Signed-off-by: NMd Fahad Iqbal Polash <md.fahad.iqbal.polash@intel.com> Signed-off-by: NAnirudh Venkataramanan <anirudh.venkataramanan@intel.com> Tested-by: NAndrew Bowers <andrewx.bowers@intel.com> Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
-
由 Akeem G Abodunrin 提交于
This patch fixes issue with non trusted VFs being able to add more than permitted number of VLANs by adding a check in ice_vc_process_vlan_msg. Also don't return an error in this case as the VF does not need to know that it is not trusted. Also rework ice_vsi_kill_vlan to use the right types. Signed-off-by: NAkeem G Abodunrin <akeem.g.abodunrin@intel.com> Signed-off-by: NAnirudh Venkataramanan <anirudh.venkataramanan@intel.com> Tested-by: NAndrew Bowers <andrewx.bowers@intel.com> Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
-
由 Brett Creeley 提交于
In ice_vsi_ctrl_rx_rings() we are unnecessarily waiting for QRX_CTRL_QENA_REQ and QRX_CTRL_QENA_STAT to be the same value prior to disabling each Rx queue. There is no reason to do this so remove this wait loop as we already have a wait loop after disabling/enabling the Rx queue through the QRX_CTRL register to make sure it gets successfully disabled/enabled. Signed-off-by: NBrett Creeley <brett.creeley@intel.com> Signed-off-by: NAnirudh Venkataramanan <anirudh.venkataramanan@intel.com> Tested-by: NAndrew Bowers <andrewx.bowers@intel.com> Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
-
由 Brett Creeley 提交于
Currently the driver allows rx-usecs-high values to be set, but when querying the device for rx-usecs-high the value does not stick. This is because it was not yet implemented. Add code to allow the user to change rx-usecs-high and use this to set the q_vector's intrl value. Signed-off-by: NBrett Creeley <brett.creeley@intel.com> Signed-off-by: NAnirudh Venkataramanan <anirudh.venkataramanan@intel.com> Tested-by: NAndrew Bowers <andrewx.bowers@intel.com> Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
-
由 Paul Greenwalt 提交于
Add support to set 52 byte RSS hash key. Signed-off-by: NPaul Greenwalt <paul.greenwalt@intel.com> Signed-off-by: NAnirudh Venkataramanan <anirudh.venkataramanan@intel.com> Tested-by: NAndrew Bowers <andrewx.bowers@intel.com> Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
-
由 Brett Creeley 提交于
There are many places in the code where we do the following: for (i = 0; i < vsi->num_q_vectors; i++) Instead use the macro mentioned in the commit title: ice_for_each_q_vector(vsi, i) Signed-off-by: NBrett Creeley <brett.creeley@intel.com> Signed-off-by: NAnirudh Venkataramanan <anirudh.venkataramanan@intel.com> Tested-by: NAndrew Bowers <andrewx.bowers@intel.com> Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
-
由 Maciej Fijalkowski 提交于
When stopping Tx rings, we use 'i' as an ring array index for looking up whether the ice_ring exists and have assigned a q_vector. This checks rings only within a given TC and we need to go through every ring in VSI. Use 'q_idx' instead. Signed-off-by: NMaciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: NAnirudh Venkataramanan <anirudh.venkataramanan@intel.com> Tested-by: NAndrew Bowers <andrewx.bowers@intel.com> Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
-
由 Brett Creeley 提交于
Reduce scope of the variable 'err' to inside the for loop instead of using it as a second looping conditional. Also while here, improve the debug message if we fail to configure a Rx queue. Signed-off-by: NBrett Creeley <brett.creeley@intel.com> Signed-off-by: NAnirudh Venkataramanan <anirudh.venkataramanan@intel.com> Tested-by: NAndrew Bowers <andrewx.bowers@intel.com> Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
-
由 Bruce Allan 提交于
Static analysis points out the default case in the switch statement in ice_get_itr_intrl_gran() is an infeasible condition causing the default case statement to be unreachable. Remove it and since the function no longer returns anything but success, change it to just return void and update the only call to it accordingly. Signed-off-by: NBruce Allan <bruce.w.allan@intel.com> Signed-off-by: NAnirudh Venkataramanan <anirudh.venkataramanan@intel.com> Tested-by: NAndrew Bowers <andrewx.bowers@intel.com> Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
-
由 Akeem G Abodunrin 提交于
If there is no queue to disable, return appropriate configuration error earlier without acquiring the lock. Signed-off-by: NAkeem G Abodunrin <akeem.g.abodunrin@intel.com> Signed-off-by: NAnirudh Venkataramanan <anirudh.venkataramanan@intel.com> Tested-by: NAndrew Bowers <andrewx.bowers@intel.com> Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
-
由 Anirudh Venkataramanan 提交于
This patch introduces a framework to store queue specific information in VSI queue contexts. Currently VSI queue context (represented by struct ice_q_ctx) only has q_handle as a member. In future patches, this structure will be updated to hold queue specific information. Signed-off-by: NAnirudh Venkataramanan <anirudh.venkataramanan@intel.com> Tested-by: NAndrew Bowers <andrewx.bowers@intel.com> Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
-
由 David S. Miller 提交于
Maxime Chevallier says: ==================== net: mvpp2: cls: Add classification This series is a rework of the previously standalone patch adding classification support for mvpp2 : https://lore.kernel.org/netdev/20190423075031.26074-1-maxime.chevallier@bootlin.com/ This patch has been reworked according to Saeed's review, to make sure that the location of the rule is always respected and serves as a way to prioritize rules between each other. This the 3rd iteration of this submission, but since it's now a series, I reset the revision numbering. This series implements that in a limited configuration for now, since we limit the total number of rules per port to 4. The main factors for this limitation are that : - We share the classification tables between all ports (4 max, although one is only used for internal loopback), hence we have to perform a logical separation between rules, which is done today by dedicated ranges for each port in each table - The "Flow table", which dictates which lookups operations are performed for an ingress packet, in subdivided into 22 "sub flows", each corresponding to a traffic type based on the L3 proto, L4 proto, the presence or not of a VLAN tag and the L3 fragmentation. This makes so that when adding a rule, it has to be added into each of these subflows, introducing duplications of entries and limiting our max number of entries. These limitations can be overcomed in several ways, but for readability sake, I'd rather submit basic classification offload support for now, and improve it gradually. This series also adds a small cosmetic cleanup patch (1), and also adds support for the "Drop" action compared to the first submission of this feature. It is simple enough to be added with this basic support. Compared to the first submissions, the NETIF_F_NTUPLE flag was also removed, following Saeed's comment. ==================== Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Maxime Chevallier 提交于
This commit introduces support for the "Drop" action in classification offload. This corresponds to the "-1" action with ethtool -N. This is achieved using the color marking actions available in the C2 engine, which associate a color to a packet. These colors can be either Green, Yellow or Red, Red meaning that the packet should be dropped. Green and Yellow colors are interpreted by the Policer, which isn't supported yet. This method of dropping using the Classifier is different than the already existing early-drop features, such as VLAN filtering and MAC UC/MC filtering, which are performed during the Parsing step, and therefore take precedence over classification actions. Signed-off-by: NMaxime Chevallier <maxime.chevallier@bootlin.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Maxime Chevallier 提交于
This commit introduces basic classification offloading support for the PPv2 controller. The PPv2 classifier has many classification engines, for now we only use the C2 TCAM match engine. This engine allows to perform ternary lookups on 64 bits keys (called Header Extracted Key), that are built by extracting fields from the packet header and concatenating them. At most 4 fields can be extracted for a single lookup. This basic implementation allows to build the HEK from the following fields : - L4 source and destination ports (for UDP and TCP) More fields are to be added in the future. Classification flows are added through the ethtool interface, using the newly introduced flow_rule infrastructure as an internal rule representation, allowing to more easily implement tc flower rules if need be. The internal design for now allocates one range of 4 rules per port due to the internal design of the flow table, which uses 22 sub-flows. When inserting a classification rule, the rule is created in every relevant sub-flow. This low rule-count is a very simple design which reaches quickly the limitations of the flow table ordering, but guarantees that the rule ordering will always be respected. This commit only introduces support for the "steer to rxq" action. Signed-off-by: NMaxime Chevallier <maxime.chevallier@bootlin.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Maxime Chevallier 提交于
As of today, the classification code is used only for RSS. We split the incoming traffic into multiple flows, that correspond to the ethtool flow_type parameter. We don't want to use the ethtool flow definitions such as TCP_V4_FLOW, for several reason : - We want to decorrelate the driver code from ethtool as much as possible, so that we can easily use other interfaces such as tc flower, - We want the flow_type to be a bitfield, so that we can match flows embedded into each other, such as TCP4 which is a subset of IP4. This commit does the conversion to the newer type. Signed-off-by: NMaxime Chevallier <maxime.chevallier@bootlin.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Maxime Chevallier 提交于
Cosmetic patch removing extra whitespaces when writing the flow_table entries Signed-off-by: NMaxime Chevallier <maxime.chevallier@bootlin.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 David S. Miller 提交于
Esben Haabendal says: ==================== net: ll_temac: x86_64 support This patch series adds support for use of ll_temac driver with platform_data configuration and fixes endianess and 64-bit problems so that it can be used on x86_64 platform. A few bugfixes are also included. Changes since v2: - Fixed lp->indirect_mutex initialization regression for OF platforms introduced in v2 Changes since v1: - Make indirect_mutex specification mandatory when using platform_data - Move header to include/linux/platform_data - Enable COMPILE_TEST for XILINX_LL_TEMAC - Rebased to v5.1-rc7 ==================== Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Esben Haabendal 提交于
As soon as TAILDESCR_PTR is written, DMA transfers might start. Let's ensure we are ready to receive DMA IRQ's before doing that. Signed-off-by: NEsben Haabendal <esben@geanix.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Esben Haabendal 提交于
This allows custom setup of IRQ coalescing for platforms using legacy platform_device. The irq timeout and count parameters can be used for tuning cpu load vs. latency. I have maintained the 0x00000400 bit in TX_CHNL_CTRL. It is specified as unused in the documentation I have available. It does not make any difference in the hardware I have available, so it is left in to not risk breaking other platforms where it might be used. Signed-off-by: NEsben Haabendal <esben@geanix.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Esben Haabendal 提交于
Use usleep_range() to avoid problems with msleep() actually sleeping much longer than expected. Signed-off-by: NEsben Haabendal <esben@geanix.com> Reviewed-by: NAndrew Lunn <andrew@lunn.ch> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Esben Haabendal 提交于
As we are actually using a BD for both the skb and each frag contained in it, the oldest TX BD would be overwritten when there was exactly one BD less than needed. Signed-off-by: NEsben Haabendal <esben@geanix.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Esben Haabendal 提交于
Unmap the actual buffer length, not the amount of data received, avoiding resource exhaustion of swiotlb (seen on x86_64 platform). Signed-off-by: NEsben Haabendal <esben@geanix.com> Reviewed-by: NAndrew Lunn <andrew@lunn.ch> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Esben Haabendal 提交于
Indirect register access goes through a DCR bus bridge, which allows only one outstanding transaction. And to make matters worse, each TEMAC IP block contains two Ethernet interfaces, and although they seem to have separate registers for indirect access, they actually share the registers. Or to be more specific, MSW, LSW and CTL registers are physically shared between Ethernet interfaces in same TEMAC IP, with RDY register being (almost) specificic to the Ethernet interface. The 0x10000 bit in RDY reflects combined bus ready state though. So we need to take care to synchronize not only within a single device, but also between devices in same TEMAC IP. This commit allows to do that with legacy platform devices. For OF devices, the xlnx,compound parent of the temac node should be used to find siblings, and setup a shared indirect_mutex between them. I will leave this work to somebody else, as I don't have hardware to test that. No regression is introduced by that, as before this commit using two Ethernet interfaces in same TEMAC block is simply broken. Signed-off-by: NEsben Haabendal <esben@geanix.com> Reviewed-by: NAndrew Lunn <andrew@lunn.ch> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Esben Haabendal 提交于
With little-endian and 64-bit support in place, the ll_temac driver can now be used on x86 and x86_64 platforms. And while at it, enable COMPILE_TEST also. Signed-off-by: NEsben Haabendal <esben@geanix.com> Reviewed-by: NAndrew Lunn <andrew@lunn.ch> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Esben Haabendal 提交于
Both TEMAC and SDMA is big-endian, so make sure that all values in SDMA buffer descriptors (cmdac_bd) are handled as big-endian, independent of the host endianness. With all currently supported platforms being big-endian, this change does not make a change for any of them. Note, when using app3 and app4 for piggybacking skb pointers there is no need to care about endianness, as neither TEMAC nor SDMA access app3 and app4 in TX buffer descriptors. Signed-off-by: NEsben Haabendal <esben@geanix.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Esben Haabendal 提交于
Replace the powerpc specific MMIO register access functions with the generic big-endian mmio access functions, and add support for little-endian access depending on configuration. Big-endian access is maintained as the default, but little-endian can be configured in device-tree binding or in platform data. The temac_ior()/temac_iow() functions are replaced with macro wrappers to avoid modifying existing code more than necessary. Signed-off-by: NEsben Haabendal <esben@geanix.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Esben Haabendal 提交于
The use of buffer descriptor APP4 field (32-bit) for storing skb pointer obviously does not work on 64-bit platforms. As APP3 is also unused, we can use that to store the other half of 64-bit pointer values. Contrary to what is hinted at in commit message of commit 15bfe05c ("net: ethernet: xilinx: Mark XILINX_LL_TEMAC broken on 64-bit") there are no other pointers stored in cdmac_bd. Signed-off-by: NEsben Haabendal <esben@geanix.com> Reviewed-by: NAndrew Lunn <andrew@lunn.ch> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Esben Haabendal 提交于
Support initialization with platdata, so the driver can be used on non-device-tree platforms. For currently supported device-tree platforms, the driver should behave as before. Signed-off-by: NEsben Haabendal <esben@geanix.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Esben Haabendal 提交于
As a side effect, a few error cases are fixed. If of_iomap() of sdma_regs failed, no error code was returned. Fixed to return -ENOMEM similar to of_iomap() fail of regs. If sysfs_create_group() or register_netdev() failed, lp->phy_node was not released. Finally, the order in remove function is corrected to be reverse order of what is done in probe, i.e. calling temac_mdio_teardown() last, so we unregister the netdev that most likely is using the mdio_bus first. Signed-off-by: NEsben Haabendal <esben@geanix.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 YueHaibing 提交于
Fix inconsistent IS_ERR and PTR_ERR in cpsw_probe, The proper pointer to use is clk instead of mode. This issue was detected with the help of Coccinelle. Fixes: 83a8471b ("net: ethernet: ti: cpsw: refactor probe to group common hw initialization") Signed-off-by: NYueHaibing <yuehaibing@huawei.com> Reviewed-by: NAndrew Lunn <andrew@lunn.ch> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 01 5月, 2019 10 次提交
-
-
由 David S. Miller 提交于
Vinicius Costa Gomes says: ==================== net/sched: taprio change schedules Changes from RFC: - Removed the patches for taprio offloading, because of the lack of in-tree users; - Updated the links to point to the PATCH version of this series; Original cover letter: Overview -------- This RFC has two objectives, it adds support for changing the running schedules during "runtime", explained in more detail later, and proposes an interface between taprio and the drivers for hardware offloading. These two different features are presented together so it's clear what the "final state" would look like. But after the RFC stage, they can be proposed (and reviewed) separately. Changing the schedules without disrupting traffic is important for handling dynamic use cases, for example, when streams are added/removed and when the network configuration changes. Hardware offloading support allows schedules to be more precise and have lower resource usage. Changing schedules ------------------ The same as the other interfaces we proposed, we try to use the same concepts as the IEEE 802.1Q-2018 specification. So, for changing schedules, there are an "oper" (operational) and an "admin" schedule. The "admin" schedule is mutable and not in use, the "oper" schedule is immutable and is in use. That is, when the user first adds an schedule it is in the "admin" state, and it becomes "oper" when its base-time (basically when it starts) is reached. What this means is that now it's possible to create taprio with a schedule: $ tc qdisc add dev IFACE parent root handle 100 taprio \ num_tc 3 \ map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 \ queues 1@0 1@1 2@2 \ base-time 10000000 \ sched-entry S 03 300000 \ sched-entry S 02 300000 \ sched-entry S 06 400000 \ clockid CLOCK_TAI And then, later, after the previous schedule is "promoted" to "oper", add a new ("admin") schedule to be used some time later: $ tc qdisc change dev IFACE parent root handle 100 taprio \ base-time 1553121866000000000 \ sched-entry S 02 500000 \ sched-entry S 0f 400000 \ clockid CLOCK_TAI When enabling the ability to change schedules, it makes sense to add two more defined knobs to schedules: "cycle-time" allows to truncate a cycle to some value, so it repeats after a well-defined value; "cycle-time-extension" controls how much an entry can be extended if it's the last one before the change of schedules, the reason is to avoid a very small cycle when transitioning from a schedule to another. With these, taprio in the software mode should provide a fairly complete implementation of what's defined in the Enhancements for Scheduled Traffic parts of the specification. Hardware offload ---------------- Some workloads require better guarantees from their schedules than what's provided by the software implementation. This series proposes an interface for configuring schedules into compatible network controllers. This part is proposed together with the support for changing schedules, because it raises questions like, should the "qdisc" side be responsible of providing visibility into the schedules or should it be the driver? In this proposal, the driver is called passing the new schedule as soon as it is validated, and the "core" qdisc takes care of displaying (".dump()") the correct schedules at all times. It means that some logic would need to be duplicated in the driver, if the hardware doesn't have support for multiple schedules. But as taprio doesn't have enough information about the underlying controller to know how much in advance a schedule needs to be informed to the hardware, it feels like a fair compromise. The hardware offloading part of this proposal also tries to define an interface for frame-preemption and how it interacts with the scheduling of traffic, see Section 8.6.8.4 of IEEE 802.1Q-2018 for more information. One important difference between the qdisc interface and the qdisc-driver interface, is that the "gate mask" on the qdisc side references traffic classes, that is bit 0 of the gate mask means Traffic Class 0, and in the driver interface, it specifies the queues, that is bit 0 means queue 0. That is to say that taprio converts the references to traffic classes to references to queues before sending the offloading request to the driver. Request for help ---------------- I would like that interested driver maintainers could take a look at the proposed interface and see if it's going to be too awkward for any particular device. Also, pointers to available documentation would be appreciated. The idea here is to start a discussion so we can have an interface that would work for multiple vendors. Links ----- kernel patches: https://github.com/vcgomes/net-next/tree/taprio-add-support-for-change-v3 iproute2 patches: https://github.com/vcgomes/iproute2/tree/taprio-add-support-for-change-v3 ==================== Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Vinicius Costa Gomes 提交于
IEEE 802.1Q-2018 defines the concept of a cycle-time-extension, so the last entry of a schedule before the start of a new schedule can be extended, so "too-short" entries can be avoided. Signed-off-by: NVinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Vinicius Costa Gomes 提交于
IEEE 802.1Q-2018 defines that a the cycle-time of a schedule may be overridden, so the schedule is truncated to a determined "width". Signed-off-by: NVinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Vinicius Costa Gomes 提交于
The IEEE 802.1Q-2018 defines two "types" of schedules, the "Oper" (from operational?) and "Admin" ones. Up until now, 'taprio' only had support for the "Oper" one, added when the qdisc is created. This adds support for the "Admin" one, which allows the .change() operation to be supported. Just for clarification, some quick (and dirty) definitions, the "Oper" schedule is the currently (as in this instant) running one, and it's read-only. The "Admin" one is the one that the system configurator has installed, it can be changed, and it will be "promoted" to "Oper" when it's 'base-time' is reached. The idea behing this patch is that calling something like the below, (after taprio is already configured with an initial schedule): $ tc qdisc change taprio dev IFACE parent root \ base-time X \ sched-entry <CMD> <GATES> <INTERVAL> \ ... Will cause a new admin schedule to be created and programmed to be "promoted" to "Oper" at instant X. If an "Admin" schedule already exists, it will be overwritten with the new parameters. Up until now, there was some code that was added to ease the support of changing a single entry of a schedule, but was ultimately unused. Now, that we have support for "change" with more well thought semantics, updating a single entry seems to be less useful. So we remove what is in practice dead code, and return a "not supported" error if the user tries to use it. If changing a single entry would make the user's life easier we may ressurrect this idea, but at this point, removing it simplifies the code. For now, only the schedule specific bits are allowed to be added for a new schedule, that means that 'clockid', 'num_tc', 'map' and 'queues' cannot be modified. Example: $ tc qdisc change dev IFACE parent root handle 100 taprio \ base-time $BASE_TIME \ sched-entry S 00 500000 \ sched-entry S 0f 500000 \ clockid CLOCK_TAI The only change in the netlink API introduced by this change is the introduction of an "admin" type in the response to a dump request, that type allows userspace to separate the "oper" schedule from the "admin" schedule. If userspace doesn't support the "admin" type, it will only display the "oper" schedule. Signed-off-by: NVinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Vinicius Costa Gomes 提交于
Right now, this isn't a problem, but the next commit allows schedules to be added during runtime. When a new schedule transitions from the inactive to the active state ("admin" -> "oper") the previous one can be freed, if it's freed just after the RCU read lock is released, we may access an invalid entry. So, we should take care to protect the dequeue() flow, so all the places that access the entries are protected by the RCU read lock. Signed-off-by: NVinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 David S. Miller 提交于
Yuchung Cheng says: ==================== undo congestion window on spurious SYN or SYNACK timeout Linux TCP currently uses the initial congestion window of 1 packet if multiple SYN or SYNACK timeouts per RFC6298. However such timeouts are often spurious on wireless or cellular networks that experience high delay variances (e.g. ramping up dormant radios or local link retransmission). Another case is when the underlying path is longer than the default SYN timeout (e.g. 1 second). In these cases starting the transfer with a minimal congestion window is detrimental to the performance for short flows. One naive approach is to simply ignore SYN or SYNACK timeouts and always use a larger or default initial window. This approach however risks pouring gas to the fire when the network is already highly congested. This is particularly true in data center where application could start thousands to millions of connections over a single or multiple hosts resulting in high SYN drops (e.g. incast). This patch-set detects spurious SYN and SYNACK timeouts upon completing the handshake via the widely-supported TCP timestamp options. Upon such events the sender reverts to the default initial window to start the data transfer so it gets best of both worlds. This patch-set supports this feature for both active and passive as well as Fast Open or regular connections. ==================== Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Yuchung Cheng 提交于
Relocate the congestion window initialization from tcp_init_metrics() to tcp_init_transfer() to improve code readability. Signed-off-by: NYuchung Cheng <ycheng@google.com> Signed-off-by: NNeal Cardwell <ncardwell@google.com> Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Yuchung Cheng 提交于
Use a helper to consolidate two identical code block for passive TFO. Signed-off-by: NYuchung Cheng <ycheng@google.com> Signed-off-by: NNeal Cardwell <ncardwell@google.com> Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Yuchung Cheng 提交于
This patch makes passive Fast Open reverts the cwnd to default initial cwnd (10 packets) if the SYNACK timeout is spurious. Passive Fast Open uses a full socket during handshake so it can use the existing undo logic to detect spurious retransmission by recording the first SYNACK timeout in key state variable retrans_stamp. Upon receiving the ACK of the SYNACK, if the socket has sent some data before the timeout, the spurious timeout is detected by tcp_try_undo_recovery() in tcp_process_loss() in tcp_ack(). But if the socket has not send any data yet, tcp_ack() does not execute the undo code since no data is acknowledged. The fix is to check such case explicitly after tcp_ack() during the ACK processing in SYN_RECV state. In addition this is checked in FIN_WAIT_1 state in case the server closes the socket before handshake completes. Signed-off-by: NYuchung Cheng <ycheng@google.com> Signed-off-by: NNeal Cardwell <ncardwell@google.com> Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Yuchung Cheng 提交于
TCP sender would use congestion window of 1 packet on the second SYN and SYNACK timeout except passive TCP Fast Open. This makes passive TFO too aggressive and unfair during congestion at handshake. This patch fixes this issue so TCP (fast open or not, passive or active) always conforms to the RFC6298. Note that tcp_enter_loss() is called only once during recurring timeouts. This is because during handshake, high_seq and snd_una are the same so tcp_enter_loss() would incorrect set the undo state variables multiple times. Signed-off-by: NYuchung Cheng <ycheng@google.com> Signed-off-by: NNeal Cardwell <ncardwell@google.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-