- 03 10月, 2017 13 次提交
-
-
由 Alan Brady 提交于
Currently we inappropriately clear the vf_states variable with a null assignment. This is problematic because we should be using atomic bitops on this variable and we don't actually want to clear all the flags. We should just clear the ones we know we want to clear. Additionally remove the I40E_VF_STATE_FCOEENA bit because it is no longer being used. Signed-off-by: NAlan Brady <alan.brady@intel.com> Tested-by: NAndrew Bowers <andrewx.bowers@intel.com> Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
-
由 Mitch Williams 提交于
This function cannot fail, so why is it returning a value? And why are we checking it? Why shouldn't we just make it void? Why is this commit message made up of only questions? Signed-off-by: NMitch Williams <mitch.a.williams@intel.com> Tested-by: NAndrew Bowers <andrewx.bowers@intel.com> Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
-
由 Alan Brady 提交于
Currently the VF gets a default number of allocated queues from HW on init and it could choose to enable or disable those allocated queues. This makes it such that the VF can request more or less underlying allocated queues from the PF. First the VF negotiates the number of queues it wants that can be supported by the PF and if successful asks for a reset. During reset the PF will reallocate the HW queues for the VF and will then remap the new queues. Signed-off-by: NAlan Brady <alan.brady@intel.com> Tested-by: NAndrew Bowers <andrewx.bowers@intel.com> Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
-
由 Jacob Keller 提交于
It is possible although rare that we may not reset when i40e_vc_disable_vf() is called. This can lead to some weird circumstances with some values not being properly set. Modify i40e_reset_vf() to return a code indicating whether it reset or not. Now, i40e_vc_disable_vf() can wait until a reset actually occurs. If it fails to free up within a reasonable time frame we'll display a warning message. Signed-off-by: NJacob Keller <jacob.e.keller@intel.com> Tested-by: NAndrew Bowers <andrewx.bowers@intel.com> Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
-
由 Jacob Keller 提交于
Replace i40e_vc_notify_vf_reset and i40e_reset_vf with a call to i40e_vc_disable_vf which does this exact thing. This matches similar code patterns throughout the driver. Signed-off-by: NJacob Keller <jacob.e.keller@intel.com> Tested-by: NAndrew Bowers <andrewx.bowers@intel.com> Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
-
由 Jacob Keller 提交于
It's never used, and the vf structure could get back to the PF if necessary. Lets just drop the extra unneeded parameter. Signed-off-by: NJacob Keller <jacob.e.keller@intel.com> Tested-by: NAndrew Bowers <andrewx.bowers@intel.com> Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
-
由 Jacob Keller 提交于
When we refactored handling of the PVID in commit 9af52f60 ("i40e: use (add|rm)_vlan_all_mac helper functions when changing PVID") we introduced a scheduling while atomic regression. This occurred because we now held the spinlock across a call to i40e_reset_vf(), which results in a usleep_range() call that triggers a scheduling while atomic bug. This was rare as it only occurred if the user configured a VLAN on a VF and also attempted to reconfigure the VF from the host system with a port VLAN. We do need to hold the lock while calling i40e_is_vsi_in_vlan(), but we should not be holding it while we reset the VF. We'll fix this by introducing a separate helper function i40e_vsi_has_vlans which checks whether we have a PVID and whether the VSI has configured VLANs. This helper function will manage its own need for the mac_filter_hash_lock. Then, we can move the acquiring of the spinlock until after we reset the VF, which ensures that we do not sleep while holding the lock. Using a separate function like this makes the code more clear and is easier to read than attempting to release and re-acquire the spinlock when we reset the VF. Signed-off-by: NJacob Keller <jacob.e.keller@intel.com> Tested-by: NAndrew Bowers <andrewx.bowers@intel.com> Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
-
由 Mariusz Stachura 提交于
Instead of accessing register directly, use newly added AQC in order to blink LEDs. Introduce and utilize a new flag to prevent excessive API version checking. Signed-off-by: NMariusz Stachura <mariusz.stachura@intel.com> Tested-by: NAndrew Bowers <andrewx.bowers@intel.com> Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
-
由 Filip Sadowski 提交于
This patch adds support for 'ethtool -m' command which displays information about (Q)SFP+ module plugged into NIC's cage. Signed-off-by: NFilip Sadowski <filip.sadowski@intel.com> Tested-by: NAndrew Bowers <andrewx.bowers@intel.com> Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
-
由 Filip Sadowski 提交于
This patch fixes incorrect reporting of supported link modes on some NICs. Signed-off-by: NFilip Sadowski <filip.sadowski@intel.com> Tested-by: NAndrew Bowers <andrewx.bowers@intel.com> Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
-
由 Christophe JAILLET 提交于
If 'kzalloc()' fails, a NULL pointer will be dereferenced. Return an error code (-ENOMEM) instead. Signed-off-by: NChristophe JAILLET <christophe.jaillet@wanadoo.fr> Tested-by: NAndrew Bowers <andrewx.bowers@intel.com> Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
-
由 Lihong Yang 提交于
This patch removes the !vf condition check that cannot be true in i40e_ndo_set_vf_trust function Detected by CoverityScan, CID 1397531 Logically dead code Signed-off-by: NLihong Yang <lihong.yang@intel.com> Tested-by: NAndrew Bowers <andrewx.bowers@intel.com> Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
-
由 Shannon Nelson 提交于
When a machine has more CPUs than queue pairs, e.g. 512 cores, the counting gets a little funky and turns off Flow Director with the message: not enough queues for Flow Director. Flow Director feature is disabled This patch limits the number of lan queues initially allocated to be sure we have some left for FD and other features. Signed-off-by: NShannon Nelson <shannon.nelson@oracle.com> Tested-by: NAndrew Bowers <andrewx.bowers@intel.com> Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
-
- 30 9月, 2017 15 次提交
-
-
由 Mitch Williams 提交于
The i40e driver now supports two different devices with two different firmware versions. So be smart about how we handle these. Move the FW version macros to the appropriate header file, and add a convenience macro that checks the version based on the device. Then use this macro to check whether or not the driver can use the new link info API. Signed-off-by: NMitch Williams <mitch.a.williams@intel.com> Tested-by: NAndrew Bowers <andrewx.bowers@intel.com> Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
-
由 Alan Brady 提交于
Currently the PF allocates a default number of queues for each VF and cannot be changed. This patch enables the VF to request a different number of queues allocated to it. This patch also adds a new virtchnl op and capability flag to facilitate this negotiation. After the PF receives a request message, it will set a requested number of queues for that VF. Then when the VF resets, its VSI will get a new number of queues allocated to it. This is a best effort request and since we only allocate a guaranteed default number, if the VF tries to ask for more than the guaranteed number, there may not be enough in HW to accommodate it unless other queues for other VFs are freed. It should also be noted decreasing the number queues allocated to a VF to below the default will NOT enable the allocation of more than 32 VFs per PF and will not free queues guaranteed to each VF by default. Signed-off-by: NAlan Brady <alan.brady@intel.com> Tested-by: NAndrew Bowers <andrewx.bowers@intel.com> Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
-
由 Alan Brady 提交于
The current implementation for mapping queues to vectors is broken because it attempts to map each Tx and Rx ring to its own vector, however we use combined queues so we should actually be mapping the Tx/Rx rings together on one vector. Also in the current implementation, in the case where we have more queues than vectors, we attempt to group the queues together into 'chunks' and map each 'chunk' of queues to a vector. Chunking them together would be more ideal if, and only if, we only had RSS because of the way the hashing algorithm works but in the case of a future patch that enables VF ADq, round robin assignment is better and still works with RSS. This patch resolves both those issues and simplifies the code needed to accomplish this. Instead of treating the case where we have more queues than vectors as special, if we notice our vector index is greater than vectors, reset the vector index to zero and continue mapping. This should ensure that in both cases, whether we have enough vectors for each queue or not, the queues get appropriately mapped. Signed-off-by: NAlan Brady <alan.brady@intel.com> Tested-by: NAndrew Bowers <andrewx.bowers@intel.com> Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
-
由 Jacob Keller 提交于
On some platforms with a large number of CPUs, we will allocate many IRQ vectors. When hibernating, the system will attempt to migrate all of the vectors back to CPU0 when shutting down all the other CPUs. It is possible that we have so many vectors that it cannot re-assign them to CPU0. This is even more likely if we have many devices installed in one platform. The end result is failure to hibernate, as it is not possible to shutdown the CPUs. We can avoid this by disabling MSI-X and clearing our interrupt scheme when the device is suspended. A more ideal solution would be some method for the stack to properly handle this for all drivers, rather than on a case-by-case basis for each driver to fix itself. However, until this more ideal solution exists, we can do our part and shutdown our IRQs during suspend, which should allow systems with a large number of CPUs to safely suspend or hibernate. It may be worth investigating if we should shut down even further when we suspend as it may make the path cleaner, but this was the minimum fix for the hibernation issue mentioned here. Testing-hints: This affects systems with a large number of CPUs, and with multiple devices enabled. Without this change, those platforms are unable to hibernate at all. Signed-off-by: NJacob Keller <jacob.e.keller@intel.com> Tested-by: NAndrew Bowers <andrewx.bowers@intel.com> Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
-
由 Jacob Keller 提交于
Although the service task does check the suspended status before running, it might already be part way through running when we go to suspend. Lets ensure that the service task is stopped and will not be restarted again until we finish resuming. This ensures that service task code does not cause strange interactions with the suspend/resume handlers. Signed-off-by: NJacob Keller <jacob.e.keller@intel.com> Tested-by: NAndrew Bowers <andrewx.bowers@intel.com> Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
-
由 Jacob Keller 提交于
When handling suspend and resume callbacks we want to make sure that (a) we don't suspend again if we're already suspended and (b) we don't resume again if we're already resuming. Lets make sure we test_and_set the __I40E_SUSPENDED bit in i40e_suspend which ensures that a suspend call when already suspended will exit early. Additionally, if __I40E_SUSPENDED is not set when we begin resuming, exit early as well. Signed-off-by: NJacob Keller <jacob.e.keller@intel.com> Tested-by: NAndrew Bowers <andrewx.bowers@intel.com> Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
-
由 Jacob Keller 提交于
Stop using the old legacy PM support, since we now have stable support for the newer generic PM callbacks. This has several advantages. First, we no longer have to manage our own pci_save_state() and power changes, as it's preferred to have the PCI stack do this. Second, these routines get called for both hibernate and suspend to ram, so we can have the driver properly handle all the suspend/resume flows that it needs to. Signed-off-by: NJacob Keller <jacob.e.keller@intel.com> Tested-by: NAndrew Bowers <andrewx.bowers@intel.com> Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
-
由 Jacob Keller 提交于
We currently (mis)use the __I40E_RECOVERY_PENDING bit to determine when we should actually request a new IRQ in i40e_setup_misc_vector(). This led to a design mistake where we open-coded the re-setup of the miscellaneous vector in i40e_resume() instead of using the function provided. If we did not open-code this and instead tried to use the i40e_setup_misc_vector() function, it would lead to never reallocating the IRQ. This would lead to a second i40e_suspend() call failing to free the vector due to a NULL pointer dereference. A future patch is going to re-work how the i40e_suspend() and i40e_resume() flows work to clear all IRQ vectors, which would require us to use i40e_setup_misc_vector() directly. Since during this time the __I40E_RECOVERY_PENDING bit is set, we'll never re-allocate the vector. Rather than leaving the open-coded setup in i40e_resume() lets just fix the problem properly in i40e_setup_misc_vector(). Introduce a new state bit which indicates when the IRQ has been assigned, which will be set when i40e_setup_misc_vector is first called. This ultimately resolves the issue of re-requesting the vector, without overloading the __I40E_RECOVERY_PENDING state. This ensures that the suspend/resume cycle can use the setup function instead of open-coding the re-request during resume. Additionally, since the only callers of i40e_stop_misc_vector also want to free it, move this code directly into the function to avoid duplication. Due to the new functionality, rename it to i40e_free_misc_vector(). This lets us drop the extra calls to free and re-enable the vector during i40e_suspend() and i40e_resume(). We don't need to call i40e_setup_misc_Vector() in i40e_resume() because it gets called by the i40e_rebuild() call. Signed-off-by: NJacob Keller <jacob.e.keller@intel.com> Tested-by: NAndrew Bowers <andrewx.bowers@intel.com> Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
-
由 Mitch Williams 提交于
We see this message regularly on VF reset or unload (which invokes a reset). It's essentially meaningless unless it's happening constantly. To prevent consternation, lower the log level to debug so it's not seen under normal circumstance. Signed-off-by: NMitch Williams <mitch.a.williams@intel.com> Tested-by: NAndrew Bowers <andrewx.bowers@intel.com> Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
-
由 Mariusz Stachura 提交于
An errata with GLQF_PCNT causes it to not wrap as expected. This can cause an error in flow director statistics. This patch resets affected counters just after reading. Signed-off-by: NMariusz Stachura <mariusz.stachura@intel.com> Tested-by: NAndrew Bowers <andrewx.bowers@intel.com> Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
-
由 Mariusz Stachura 提交于
Fortville and Fort Park devices are often on different firmware release schedules. This change relaxes the minor version warning message, so it is only displayed for older FW warning version for old firmware Fortville 3 or earlier. Signed-off-by: NMariusz Stachura <mariusz.stachura@intel.com> Tested-by: NAndrew Bowers <andrewx.bowers@intel.com> Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
-
由 Sudheer Mogilappagari 提交于
This commit replaces usage of vsi->back in i40e_print_link_message() (which is actually a PF pointer) with temp variable. Signed-off-by: NSudheer Mogilappagari <sudheer.mogilappagari@intel.com> Tested-by: NAndrew Bowers <andrewx.bowers@intel.com> Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
-
由 Sudheer Mogilappagari 提交于
i40e_print_link_message() is intended to compare new link state with current link state and print log message only if the new state is different from current state. However in current driver the new state does not get updated when link is going down because of the if condition. When an interface is brought down, vsi->state is set to I40E_VSI_DOWN in i40e_vsi_close() and later i40e_print_link_message() does not get invoked in i40e_link_event due to if condition. Hence link down message doesn't appear when link is going down. The down state is seen later during i40e_open() and old state gets printed. The actual link state doesn't get updated in i40e_close() or i40e_open() but when i40e_handle_link_event is called inside i40e_clean_adminq_subtask. This change allows i40e_print_link_message() to be called when interface is going down and keeps the state information updated. Signed-off-by: NSudheer Mogilappagari <sudheer.mogilappagari@intel.com> Tested-by: NAndrew Bowers <andrewx.bowers@intel.com> Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
-
由 Sudheer Mogilappagari 提交于
In current driver, when ifconfig ethx up is done, the link state doesn't transition to UP inside i40e_open(). It changes after AQ command response is handled in i40e_handle_link_event(). When pf->hw.phy.link_info.link_info is DOWN inside i40e_open(), The state is transient and invalid. So log message gets printed based on incorrect info (i.e link_info and an_info). This commit removes check for unqualified module inside i40e_up_complete(). The existing check in i40e_handle_link_event() logs the error message based on correct link state information. Signed-off-by: NSudheer Mogilappagari <sudheer.mogilappagari@intel.com> Tested-by: NAndrew Bowers <andrewx.bowers@intel.com> Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
-
由 Jacob Keller 提交于
This value is not calculating bytes_per_int, which would actually just be bytes/ITR_COUNTDOWN_START, but rather it's calculating bytes/usecs. Rename the variable for clarity so that future developers understand what the value is actually calculating. Signed-off-by: NJacob Keller <jacob.e.keller@intel.com> Tested-by: NAndrew Bowers <andrewx.bowers@intel.com> Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
-
- 27 9月, 2017 2 次提交
-
-
由 Daniel Borkmann 提交于
Implement support for transferring XDP meta data into skb for ixgbe driver; before calling into the program, xdp.data_meta points to xdp.data, where on program return with pass verdict, we call into skb_metadata_set(). We implement this for the default ixgbe_build_skb() variant. For the ixgbe_construct_skb() that is used when legacy-rx buffer mananagement mode is turned on via ethtool, I found that XDP gets 0 headroom, so neither xdp_adjust_head() nor xdp_adjust_meta() can be used with this. Just add a comment with explanation for this operating mode. Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Acked-by: NAlexei Starovoitov <ast@kernel.org> Acked-by: NJohn Fastabend <john.fastabend@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Daniel Borkmann 提交于
This work enables generic transfer of metadata from XDP into skb. The basic idea is that we can make use of the fact that the resulting skb must be linear and already comes with a larger headroom for supporting bpf_xdp_adjust_head(), which mangles xdp->data. Here, we base our work on a similar principle and introduce a small helper bpf_xdp_adjust_meta() for adjusting a new pointer called xdp->data_meta. Thus, the packet has a flexible and programmable room for meta data, followed by the actual packet data. struct xdp_buff is therefore laid out that we first point to data_hard_start, then data_meta directly prepended to data followed by data_end marking the end of packet. bpf_xdp_adjust_head() takes into account whether we have meta data already prepended and if so, memmove()s this along with the given offset provided there's enough room. xdp->data_meta is optional and programs are not required to use it. The rationale is that when we process the packet in XDP (e.g. as DoS filter), we can push further meta data along with it for the XDP_PASS case, and give the guarantee that a clsact ingress BPF program on the same device can pick this up for further post-processing. Since we work with skb there, we can also set skb->mark, skb->priority or other skb meta data out of BPF, thus having this scratch space generic and programmable allows for more flexibility than defining a direct 1:1 transfer of potentially new XDP members into skb (it's also more efficient as we don't need to initialize/handle each of such new members). The facility also works together with GRO aggregation. The scratch space at the head of the packet can be multiple of 4 byte up to 32 byte large. Drivers not yet supporting xdp->data_meta can simply be set up with xdp->data_meta as xdp->data + 1 as bpf_xdp_adjust_meta() will detect this and bail out, such that the subsequent match against xdp->data for later access is guaranteed to fail. The verifier treats xdp->data_meta/xdp->data the same way as we treat xdp->data/xdp->data_end pointer comparisons. The requirement for doing the compare against xdp->data is that it hasn't been modified from it's original address we got from ctx access. It may have a range marking already from prior successful xdp->data/xdp->data_end pointer comparisons though. Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Acked-by: NAlexei Starovoitov <ast@kernel.org> Acked-by: NJohn Fastabend <john.fastabend@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 22 9月, 2017 4 次提交
-
-
由 Thomas Meyer 提交于
Use *_pool_zalloc rather than *_pool_alloc followed by memset with 0. Found by coccinelle spatch "api/alloc/pool_zalloc-simple.cocci" Signed-off-by: NThomas Meyer <thomas@m3y3r.de> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Allen Pais 提交于
Use setup_timer function instead of initializing timer with the function and data fields. Signed-off-by: NAllen Pais <allen.lkml@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net> -
由 Allen Pais 提交于
Use setup_timer function instead of initializing timer with the function and data fields. Signed-off-by: NAllen Pais <allen.lkml@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net> -
由 Allen Pais 提交于
Use setup_timer function instead of initializing timer with the function and data fields. Signed-off-by: NAllen Pais <allen.lkml@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 06 9月, 2017 2 次提交
-
-
由 Jacob Keller 提交于
When introducing the functions to read the NVM through the AdminQ, we did not correctly mark the wb_desc. Fixes: 7073f46e ("i40e: Add AQ commands for NVM Update for X722", 2015-06-05) Signed-off-by: NJacob Keller <jacob.e.keller@intel.com> Tested-by: NAndrew Bowers <andrewx.bowers@intel.com> Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
-
由 Anjali Singhai Jain 提交于
X722 devices use the AdminQ to access the NVM, and this requires taking the AdminQ lock. Because of this, we lock the AdminQ during i40e_read_nvm(), which is also called in places where the lock is already held, such as the firmware update path which wants to lock once and then unlock when finished after performing several tasks. Although this should have only affected X722 devices, commit 96a39aed ("i40e: Acquire NVM lock before reads on all devices", 2016-12-02) added locking for all NVM reads, regardless of device family. This resulted in us accidentally causing NVM acquire timeouts on all devices, causing failed firmware updates which left the eeprom in a corrupt state. Create unsafe non-locked variants of i40e_read_nvm_word and i40e_read_nvm_buffer, __i40e_read_nvm_word and __i40e_read_nvm_buffer respectively. These variants will not take the NVM lock and are expected to only be called in places where the NVM lock is already held if needed. Since the only caller of i40e_read_nvm_buffer() was in such a path, remove it entirely in favor of the unsafe version. If necessary we can always add it back in the future. Additionally, we now need to hold the NVM lock in i40e_validate_checksum because the call to i40e_calc_nvm_checksum now assumes that the NVM lock is held. We can further move the call to read I40E_SR_SW_CHECKSUM_WORD up a bit so that we do not need to acquire the NVM lock twice. This should resolve firmware updates and also fix potential raise that could have caused the driver to report an invalid NVM checksum upon driver load. Reported-by: NStefan Assmann <sassmann@kpanic.de> Fixes: 96a39aed ("i40e: Acquire NVM lock before reads on all devices", 2016-12-02) Signed-off-by: NAnjali Singhai Jain <anjali.singhai@intel.com> Signed-off-by: NJacob Keller <jacob.e.keller@intel.com> Tested-by: NAndrew Bowers <andrewx.bowers@intel.com> Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
-
- 28 8月, 2017 4 次提交
-
-
由 Jacob Keller 提交于
The dynamic ITR algorithm depends on a calculation of usecs which assumes that the interrupts have been firing constantly at the interrupt throttle rate. This is not guaranteed because we could have a low packet rate, or have been polling in software. We'll estimate whether this is the case by using jiffies to determine if we've been too long. If the time difference of jiffies is larger we are guaranteed to have an incorrect calculation. If the time difference of jiffies is smaller we might have been polling some but the difference shouldn't affect the calculation too much. This ensures that we don't get stuck in BULK latency during certain rare situations where we receive bursts of packets that force us into NAPI polling. Signed-off-by: NJacob Keller <jacob.e.keller@intel.com> Tested-by: NAndrew Bowers <andrewx.bowers@intel.com> Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
-
由 Jacob Keller 提交于
Since commit c56625d5 ("i40e/i40evf: change dynamic interrupt thresholds") a new higher latency ITR setting called I40E_ULTRA_LATENCY was added with a cryptic comment about how it was meant for adjusting Rx more aggressively when streaming small packets. This mode was attempting to calculate packets per second and then kick in when we have a huge number of small packets. Unfortunately, the ULTRA setting was kicking in for workloads it wasn't intended for including single-thread UDP_STREAM workloads. This wasn't caught for a variety of reasons. First, the ip_defrag routines were improved somewhat which makes the UDP_STREAM test still reasonable at 10GbE, even when dropped down to 8k interrupts a second. Additionally, some other obvious workloads appear to work fine, such as TCP_STREAM. The number 40k doesn't make sense for a number of reasons. First, we absolutely can do more than 40k packets per second. Second, we calculate the value inline in an integer, which sometimes can overflow resulting in using incorrect values. If we fix this overflow it makes it even more likely that we'll enter ULTRA mode which is the opposite of what we want. The ULTRA mode was added originally as a way to reduce CPU utilization during a small packet workload where we weren't keeping up anyways. It should never have been kicking in during these other workloads. Given the issues outlined above, let's remove the ULTRA latency mode. If necessary, a better solution to the CPU utilization issue for small packet workloads will be added in a future patch. Signed-off-by: NJacob Keller <jacob.e.keller@intel.com> Tested-by: NAndrew Bowers <andrewx.bowers@intel.com> Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
-
由 Jacob Keller 提交于
In commit 96db776a ("i40e/vf: fix interrupt affinity bug") we added some code to force exit of polling in case we did not have the correct CPU. This is important since it was possible for the IRQ affinity to be changed while the CPU is pegged at 100%. This can result in the polling routine being stuck on the wrong CPU until traffic finally stops. Unfortunately, the implementation, "if the CPU is correct, exit as normal, otherwise, fall-through to the end-polling exit" is incredibly confusing to reason about. In this case, the normal flow looks like the exception, while the exception actually occurs far away from the if statement and comment. We recently discovered and fixed a bug in this code because we were incorrectly initializing the affinity mask. Re-write the code so that the exceptional case is handled at the check, rather than having the logic be spread through the regular exit flow. This does end up with minor code duplication, but the resulting code is much easier to reason about. The new logic is identical, but inverted. If we are running on a CPU not in our affinity mask, we'll exit polling. However, the code flow is much easier to understand. Note that we don't actually have to check for MSI-X, because in the MSI case we'll only have one q_vector, but its default affinity mask should be correct as it includes all CPUs when it's initialized. Further, we could at some point add code to setup the notifier for the non-MSI-X case and enable this workaround for that case too, if desired, though there isn't much gain since its unlikely to be the common case. Signed-off-by: NJacob Keller <jacob.e.keller@intel.com> Tested-by: NAndrew Bowers <andrewx.bowers@intel.com> Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
-
由 Jacob Keller 提交于
On older kernels a call to irq_set_affinity_hint does not guarantee that the IRQ affinity will be set. If nothing else on the system sets the IRQ affinity this can result in a bug in the i40e_napi_poll() routine where we notice that our interrupt fired on the "wrong" CPU according to our internal affinity_mask variable. This results in a bug where we continuously tell NAPI to stop polling to move the interrupt to a new CPU, but the CPU never changes because our affinity mask does not match the actual mask setup for the IRQ. The root problem is a mismatched affinity mask value. So lets initialize the value to cpu_possible_mask instead. This ensures that prior to the first time we get an IRQ affinity notification we'll have the mask set to include every possible CPU. We use cpu_possible_mask instead of cpu_online_mask since the former is almost certainly never going to change, while the later might change after we've made a copy. Signed-off-by: NJacob Keller <jacob.e.keller@intel.com> Tested-by: NAndrew Bowers <andrewx.bowers@intel.com> Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
-