- 25 5月, 2018 1 次提交
-
-
由 Jesper Dangaard Brouer 提交于
This patch change the API for ndo_xdp_xmit to support bulking xdp_frames. When kernel is compiled with CONFIG_RETPOLINE, XDP sees a huge slowdown. Most of the slowdown is caused by DMA API indirect function calls, but also the net_device->ndo_xdp_xmit() call. Benchmarked patch with CONFIG_RETPOLINE, using xdp_redirect_map with single flow/core test (CPU E5-1650 v4 @ 3.60GHz), showed performance improved: for driver ixgbe: 6,042,682 pps -> 6,853,768 pps = +811,086 pps for driver i40e : 6,187,169 pps -> 6,724,519 pps = +537,350 pps With frames avail as a bulk inside the driver ndo_xdp_xmit call, further optimizations are possible, like bulk DMA-mapping for TX. Testing without CONFIG_RETPOLINE show the same performance for physical NIC drivers. The virtual NIC driver tun sees a huge performance boost, as it can avoid doing per frame producer locking, but instead amortize the locking cost over the bulk. V2: Fix compile errors reported by kbuild test robot <lkp@intel.com> V4: Isolated ndo, driver changes and callers. Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
-
- 01 5月, 2018 1 次提交
-
-
由 Jacob Keller 提交于
Recent versions of the Linux kernel now warn about incorrect parameter definitions for function comments. Fix up several function comments to correctly reflect the current function arguments. This cleans up the warnings and helps ensure our documentation is accurate. Signed-off-by: NJacob Keller <jacob.e.keller@intel.com> Tested-by: NAndrew Bowers <andrewx.bowers@intel.com> Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
-
- 28 4月, 2018 1 次提交
-
-
由 Jeff Kirsher 提交于
After many years of having a ~30 line copyright and license header to our source files, we are finally able to reduce that to one line with the advent of the SPDX identifier. Also caught a few files missing the SPDX license identifier, so fixed them up. Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com> Acked-by: NShannon Nelson <shannon.nelson@oracle.com> Acked-by: NRichard Cochran <richardcochran@gmail.com> Tested-by: NAndrew Bowers <andrewx.bowers@intel.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 17 4月, 2018 3 次提交
-
-
由 Jesper Dangaard Brouer 提交于
Changing API ndo_xdp_xmit to take a struct xdp_frame instead of struct xdp_buff. This brings xdp_return_frame and ndp_xdp_xmit in sync. This builds towards changing the API further to become a bulk API, because xdp_buff is not a queue-able object while xdp_frame is. V4: Adjust for commit 59655a5b ("tuntap: XDP_TX can use native XDP") V7: Adjust for commit d9314c47 ("i40e: add support for XDP_REDIRECT") Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jesper Dangaard Brouer 提交于
Changing API xdp_return_frame() to take struct xdp_frame as argument, seems like a natural choice. But there are some subtle performance details here that needs extra care, which is a deliberate choice. When de-referencing xdp_frame on a remote CPU during DMA-TX completion, result in the cache-line is change to "Shared" state. Later when the page is reused for RX, then this xdp_frame cache-line is written, which change the state to "Modified". This situation already happens (naturally) for, virtio_net, tun and cpumap as the xdp_frame pointer is the queued object. In tun and cpumap, the ptr_ring is used for efficiently transferring cache-lines (with pointers) between CPUs. Thus, the only option is to de-referencing xdp_frame. It is only the ixgbe driver that had an optimization, in which it can avoid doing the de-reference of xdp_frame. The driver already have TX-ring queue, which (in case of remote DMA-TX completion) have to be transferred between CPUs anyhow. In this data area, we stored a struct xdp_mem_info and a data pointer, which allowed us to avoid de-referencing xdp_frame. To compensate for this, a prefetchw is used for telling the cache coherency protocol about our access pattern. My benchmarks show that this prefetchw is enough to compensate the ixgbe driver. V7: Adjust for commit d9314c47 ("i40e: add support for XDP_REDIRECT") V8: Adjust for commit bd658dda ("net/mlx5e: Separate dma base address and offset in dma_sync call") Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jesper Dangaard Brouer 提交于
Also convert driver i40e, which very recently got XDP_REDIRECT support in commit d9314c47 ("i40e: add support for XDP_REDIRECT"). V7: This patch got added in V7 of this patchset. Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 27 3月, 2018 3 次提交
-
-
由 Björn Töpel 提交于
The driver now acts upon the XDP_REDIRECT return action. Two new ndos are implemented, ndo_xdp_xmit and ndo_xdp_flush. XDP_REDIRECT action enables XDP program to redirect frames to other netdevs. Signed-off-by: NBjörn Töpel <bjorn.topel@intel.com> Tested-by: NAndrew Bowers <andrewx.bowers@intel.com> Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
-
由 Björn Töpel 提交于
This commit tweaks the page counting for XDP_REDIRECT to function properly. XDP_REDIRECT support will be added in a future commit. The current page counting scheme assumes that the reference count cannot decrease until the received frame is sent to the upper layers of the networking stack. This assumption does not hold for the XDP_REDIRECT action, since a page (pointed out by xdp_buff) can have its reference count decreased via the xdp_do_redirect call. To work around that, we now start off by a large page count and then don't allow a refcount less than two. Signed-off-by: NBjörn Töpel <bjorn.topel@intel.com> Tested-by: NAndrew Bowers <andrewx.bowers@intel.com> Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
-
由 Jacob Keller 提交于
The two Flow Directory auto disable flags are used at run time to mark when the flow director features needed to be disabled. Thus the flags could change even when the RTNL lock is not held. They also have some code constructions which really should be test_and_set or test_and_clear using atomic bit operations. Create new state fields to mark this, and stop including them in pf->flags. This is part of a larger effort to remove the need for cmpxchg64 in i40e_set_priv_flags(). Signed-off-by: NJacob Keller <jacob.e.keller@intel.com> Tested-by: NAndrew Bowers <andrewx.bowers@intel.com> Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
-
- 24 3月, 2018 1 次提交
-
-
由 Jeff Kirsher 提交于
Add the SPDX identifiers to all the Intel wired LAN driver files, as outlined in Documentation/process/license-rules.rst. Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com> Tested-by: NAaron Brown <aaron.f.brown@intel.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 27 2月, 2018 1 次提交
-
-
由 Alan Brady 提交于
The i40e_detect_recover_hung function uses the i40e_get_tx_pending function to determine if there are packets stalled on the ring. i40e_get_tx_pending calculates the pending packets using the head writeback value and HW tail. If the queue is stopped and we lose the interrupt to update our next_to_clean then we a) won't get another interrupt to clean because queue is stopped b) we won't catch the problem with i40e_detect_recover_hung because the HW values look like there's no packets waiting to be transmitted. Using the SW values we can catch the issue because next_to_clean will be out of sync with head writeback. This has the added benefit being less CPU intensive because we don't need to reach into the hardware to get the values. Signed-off-by: NAlan Brady <alan.brady@intel.com> Tested-by: NAndrew Bowers <andrewx.bowers@intel.com> Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
-
- 13 2月, 2018 7 次提交
-
-
由 Alexander Duyck 提交于
This patch replaces the existing mechanism for determining the correct value to program for adaptive ITR with yet another new and more complicated approach. The basic idea from a 30K foot view is that this new approach will push the Rx interrupt moderation up so that by default it starts in low latency and is gradually pushed up into a higher latency setup as long as doing so increases the number of packets processed, if the number of packets drops to 4 to 1 per packet we will reset and just base our ITR on the size of the packets being received. For Tx we leave it floating at a high interrupt delay and do not pull it down unless we start processing more than 112 packets per interrupt. If we start exceeding that we will cut our interrupt rates in half until we are back below 112. The side effect of these patches are that we will be processing more packets per interrupt. This is both a good and a bad thing as it means we will not be blocking processing in the case of things like pktgen and XDP, but we will also be consuming a bit more CPU in the cases of things such as network throughput tests using netperf. One delta from this versus the ixgbe version of the changes is that I have made the interrupt moderation a bit more aggressive when we are in bulk mode by moving our "goldilocks zone" up from 48 to 96 to 56 to 112. The main motivation behind moving this is to address the fact that we need to update less frequently, and have more fine grained control due to the separate Tx and Rx ITR times. Signed-off-by: NAlexander Duyck <alexander.h.duyck@intel.com> Tested-by: NAndrew Bowers <andrewx.bowers@intel.com> Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
-
由 Alexander Duyck 提交于
This patch is mostly prep-work for replacing the current approach to programming the dynamic aka adaptive ITR. Specifically here what we are doing is splitting the Tx and Rx ITR each into two separate values. The first value current_itr represents the current value of the register. The second value target_itr represents the desired value of the register. The general plan by doing this is to allow for deferring the update of the ITR value under certain circumstances. For now we will work with what we have, but in the future I hope to change the behavior so that we always only update one ITR at a time using some simple logic to determine which ITR requires an update. Signed-off-by: NAlexander Duyck <alexander.h.duyck@intel.com> Tested-by: NAndrew Bowers <andrewx.bowers@intel.com> Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
-
由 Alexander Duyck 提交于
Instead of using the register value for the defines when setting up the ring ITR we can just use the actual values and avoid the use of shifts and macros to translate between the values we have and the values we want. This helps to make the code more readable as we can quickly translate from one value to the other. Signed-off-by: NAlexander Duyck <alexander.h.duyck@intel.com> Tested-by: NAndrew Bowers <andrewx.bowers@intel.com> Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
-
由 Alexander Duyck 提交于
The CLEARPBA bit in the dynamic interrupt control register actually has no effect either way on the hardware. As per errata 28 in the XL710 specification update the interrupt is actually cleared any time the register is written with the INTENA_MSK bit set to 0. As such the act of toggling the enable bit actually will trigger the interrupt being cleared and could lead to potential lost events if auto-masking is not enabled. Signed-off-by: NAlexander Duyck <alexander.h.duyck@intel.com> Tested-by: NAndrew Bowers <andrewx.bowers@intel.com> Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
-
由 Alexander Duyck 提交于
The logic for dynamic ITR update is confusing at best as there were odd paths chosen for how to find the rings associated with a given queue based on the vector index and other inconsistencies throughout the code. This patch is an attempt to clean up the logic so that we can more easily understand what is going on. Specifically if there is a Rx or Tx ring that is enabled in dynamic mode on the q_vector it is allowed to override the other side of the interrupt moderation. While it isn't correct all this patch is doing is cleaning up the logic for now so that when we come through and fix it we can more easily identify that this is wrong. The other big change made here is that we replace references to: vsi->rx_rings[q_vector->v_idx]->itr_setting with: q_vector->rx.ring->itr_setting The general idea is we can avoid the long pointer chase since just accessing q_vector->rx.ring is a single pointer access versus having to chase down vsi->rx_rings, and then finding the pointer in the array, and finally chasing down the itr_setting from there. Signed-off-by: NAlexander Duyck <alexander.h.duyck@intel.com> Tested-by: NAndrew Bowers <andrewx.bowers@intel.com> Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
-
由 Alexander Duyck 提交于
The rings are already split out into Tx and Rx rings so it doesn't make sense to have any single ring store both a Tx and Rx itr_setting value. Since that is the case drop the pair in favor of storing just a single ITR value. Signed-off-by: NAlexander Duyck <alexander.h.duyck@intel.com> Tested-by: NAndrew Bowers <andrewx.bowers@intel.com> Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
-
由 Alan Brady 提交于
'bufer' should be 'buffer' Signed-off-by: NAlan Brady <alan.brady@intel.com> Tested-by: NAndrew Bowers <andrewx.bowers@intel.com> Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
-
- 27 1月, 2018 1 次提交
-
-
由 Alexander Duyck 提交于
The drivers for i40e and i40evf had a reg_idx value stored in the q_vector that was going completely unused. I can only assume this was copied over from ixgbe and nobody knew how to use it. I'm going to make use of the value to avoid having to compute the vector and thus the register index for multiple paths throughout the drivers. Signed-off-by: NAlexander Duyck <alexander.h.duyck@intel.com> Tested-by: NAndrew Bowers <andrewx.bowers@intel.com> Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
-
- 24 1月, 2018 1 次提交
-
-
由 Sudheer Mogilappagari 提交于
In VFs, there is a known issue which can cause writebacks to not occur when interrupts are disabled and there are less than 4 descriptors resulting in TX timeout. Timeout can also occur due to lost interrupt. The current implementation for detecting and recovering from hung queues in the PF is problematic because it actually actively encourages lost interrupts. By triggering a SW interrupt, interrupts are forced on. If we are already in napi_poll and an interrupt fires, napi_poll will not be rescheduled and the interrupt is effectively lost; thereby potentially *causing* hung queues. This patch checks whether packets are being processed between every watchdog cycle and determine potential hung queue and fires triggers SW interrupt only for that particular queue. Signed-off-by: NSudheer Mogilappagari <sudheer.mogilappagari@intel.com> Tested-by: NAndrew Bowers <andrewx.bowers@intel.com> Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
-
- 06 1月, 2018 1 次提交
-
-
由 Jesper Dangaard Brouer 提交于
The i40e driver has a special "FDIR" RX-ring (I40E_VSI_FDIR) which is a sideband channel for configuring/updating the flow director tables. This (i40e_vsi_)type does not invoke XDP-ebpf code. As suggested by Björn (V2): Instead of marking this I40E_VSI_FDIR RX-ring a special case, reverse the logic and only select RX-rings of type I40E_VSI_MAIN to register xdp_rxq_info's for. Driver hook points for xdp_rxq_info: * reg : i40e_setup_rx_descriptors (via i40e_vsi_setup_rx_resources) * unreg: i40e_free_rx_resources (via i40e_vsi_free_rx_resources) Tested on actual hardware with samples/bpf program. V2: Fixed bug in i40e_set_ringparam (memset zero) + match on I40E_VSI_MAIN. V4: Update patch desc that got out-of-sync with code. Cc: intel-wired-lan@lists.osuosl.org Cc: Björn Töpel <bjorn.topel@intel.com> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Cc: Paul Menzel <pmenzel@molgen.mpg.de> Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com> Reviewed-by: NPaul Menzel <pmenzel@molgen.mpg.de> Acked-by: NJohn Fastabend <john.fastabend@gmail.com> Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
-
- 04 1月, 2018 1 次提交
-
-
由 Alexander Duyck 提交于
The original code for __i40e_chk_linearize didn't take into account the fact that if a fragment is 16K in size or larger it has to be split over 2 descriptors and the smaller of those 2 descriptors will be on the trailing edge of the transmit. As a result we can get into situations where we didn't catch requests that could result in a Tx hang. This patch takes care of that by subtracting the length of all but the trailing edge of the stale fragment before we test for sum. By doing this we can guarantee that we have all cases covered, including the case of a fragment that spans multiple descriptors. We don't need to worry about checking the inner portions of this since 12K is the maximum aligned DMA size and that is larger than any MSS will ever be since the MTU limit for jumbos is something on the order of 9K. Signed-off-by: NAlexander Duyck <alexander.h.duyck@intel.com> Tested-by: NAndrew Bowers <andrewx.bowers@intel.com> Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
-
- 22 11月, 2017 1 次提交
-
-
由 Brian King 提交于
The original issue being fixed in this patch was seen with the ixgbe driver, but the same issue exists with i40e as well, as the code is very similar. read_barrier_depends is not sufficient to ensure loads following it are not speculatively loaded out of order by the CPU, which can result in stale data being loaded, causing potential system crashes. Cc: stable <stable@vger.kernel.org> Signed-off-by: NBrian King <brking@linux.vnet.ibm.com> Acked-by: NJesse Brandeburg <jesse.brandeburg@intel.com> Tested-by: NAndrew Bowers <andrewx.bowers@intel.com> Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
-
- 01 11月, 2017 1 次提交
-
-
由 Alexander Duyck 提交于
This reverts commit 11f29003. I am reverting this as I am fairly certain this can result in a memory leak when combined with the current page recycling scheme. Specifically we end up attempting to allocate fewer buffers than we recycled and this results in us rewinding the next to alloc pointer which leads to leaks when we overwrite the rx_buffer_info when processing the next frame. Fixes: 11f29003 ("i40e/i40evf: bump tail only in multiples of 8") Signed-off-by: NAlexander Duyck <alexander.h.duyck@intel.com> Tested-by: NAndrew Bowers <andrewx.bowers@intel.com> Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
-
- 26 10月, 2017 2 次提交
-
-
由 Alexander Duyck 提交于
This patch updates the i40e driver to include programming descriptors in the cleaned_count. Without this change it becomes possible for us to leak memory as we don't trigger a large enough allocation when the time comes to allocate new buffers and we end up overwriting a number of rx_buffers equal to the number of programming descriptors we encountered. Fixes: 0e626ff7 ("i40e: Fix support for flow director programming status") Signed-off-by: NAlexander Duyck <alexander.h.duyck@intel.com> Tested-by: NAnders K. Pedersen <akp@cohaesio.com> Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
-
由 Alexander Duyck 提交于
It looks like there was either a copy/paste error or just a typo that resulted in the Tx ITR setting being used to determine if we were using adaptive Rx interrupt moderation or not. This patch fixes the typo. Fixes: 65e87c03 ("i40evf: support queue-specific settings for interrupt moderation") Signed-off-by: NAlexander Duyck <alexander.h.duyck@intel.com> Tested-by: NAndrew Bowers <andrewx.bowers@intel.com> Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
-
- 10 10月, 2017 3 次提交
-
-
由 Alexander Duyck 提交于
It looks like we weren't correctly placing the pages from buffers that had been used to return a filter programming status back on the ring. As a result they were being overwritten and tracking of the pages was lost. This change works to correct that by incorporating part of i40e_put_rx_buffer into the programming status handler code. As a result we should now be correctly placing the pages for those buffers on the re-allocation list instead of letting them stay in place. Fixes: 0e626ff7 ("i40e: Fix support for flow director programming status") Reported-by: NAnders K. Pedersen <akp@cohaesio.com> Signed-off-by: NAlexander Duyck <alexander.h.duyck@intel.com> Tested-by: NAnders K Pedersen <akp@cohaesio.com> Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
-
由 Jacob Keller 提交于
Hardware only fetches descriptors on cachelines of 8, essentially ignoring the lower 3 bits of the tail register. Thus, it is pointless to bump tail by an unaligned access as the hardware will ignore some of the new descriptors we allocated. Thus, it's ideal if we can ensure tail writes are always aligned to 8. At first, it seems like we'd already do this, since we allocate descriptors in batches which are a multiple of 8. Since we'd always increment by a multiple of 8, it seems like the value should always be aligned. However, this ignores allocation failures. If we fail to allocate a buffer, our tail register will become unaligned. Once it has become unaligned it will essentially be stuck unaligned until a buffer allocation happens to fail at the exact amount necessary to re-align it. We can do better, by simply rounding down the number of buffers we're about to allocate (cleaned_count) such that "next_to_clean + cleaned_count" is rounded to the nearest multiple of 8. We do this by calculating how far off that value is and subtracting it from the cleaned_count. This essentially defers allocation of buffers if they're going to be ignored by hardware anyways, and re-aligns our next_to_use and tail values after a failure to allocate a descriptor. This calculation ensures that we always align the tail writes in a way the hardware expects and don't unnecessarily allocate buffers which won't be fetched immediately. Signed-off-by: NJacob Keller <jacob.e.keller@intel.com> Tested-by: NAndrew Bowers <andrewx.bowers@intel.com> Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
-
由 Jacob Keller 提交于
In the past we changed driver behavior to not clear the PBA when re-enabling interrupts. This change was motivated by the flawed belief that clearing the PBA would cause a lost interrupt if a receive interrupt occurred while interrupts were disabled. According to empirical testing this isn't the case. Additionally, the data sheet specifically says that we should set the CLEARPBA bit when re-enabling interrupts in a polling setup. This reverts commit 40d72a50 ("i40e/i40evf: don't lose interrupts") Signed-off-by: NJacob Keller <jacob.e.keller@intel.com> Tested-by: NAndrew Bowers <andrewx.bowers@intel.com> Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
-
- 06 10月, 2017 1 次提交
-
-
由 Jacob Keller 提交于
Since commit 6a7fded7 ("i40e: Fix RS bit update in Tx path and disable force WB workaround") we've tried to "optimize" setting the RS bit based around skb->xmit_more. This same logic was refactored in commit 1dc8b538 ("i40e: Reorder logic for coalescing RS bits"), but ultimately was not functionally changed. Using skb->xmit_more in this way is incorrect, because in certain circumstances we may see a large number of skbs in sequence with xmit_more set. This leads to a performance loss as the hardware does not writeback anything for those packets, which delays the time it takes for us to respond to the stack transmit requests. This significantly impacts UDP performance, especially when layered with multiple devices, such as bonding, VLANs, and vnet setups. This was not noticed until now because it is difficult to create a setup which reproduces the issue. It was discovered in a UDP_STREAM test in a VM, connected using a vnet device to a bridge, which is connected to a bonded pair of X710 ports in active-backup mode with a VLAN. These layered devices seem to compound the number of skbs transmitted at once by the qdisc. Additionally, the problem can be masked by reducing the ITR value. Since the original commit does not provide strong justification for this RS bit "optimization", revert to the previous behavior of setting the RS bit every 4th packet. Signed-off-by: NJacob Keller <jacob.e.keller@intel.com> Tested-by: NAndrew Bowers <andrewx.bowers@intel.com> Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
-
- 30 9月, 2017 1 次提交
-
-
由 Jacob Keller 提交于
This value is not calculating bytes_per_int, which would actually just be bytes/ITR_COUNTDOWN_START, but rather it's calculating bytes/usecs. Rename the variable for clarity so that future developers understand what the value is actually calculating. Signed-off-by: NJacob Keller <jacob.e.keller@intel.com> Tested-by: NAndrew Bowers <andrewx.bowers@intel.com> Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
-
- 27 9月, 2017 1 次提交
-
-
由 Daniel Borkmann 提交于
This work enables generic transfer of metadata from XDP into skb. The basic idea is that we can make use of the fact that the resulting skb must be linear and already comes with a larger headroom for supporting bpf_xdp_adjust_head(), which mangles xdp->data. Here, we base our work on a similar principle and introduce a small helper bpf_xdp_adjust_meta() for adjusting a new pointer called xdp->data_meta. Thus, the packet has a flexible and programmable room for meta data, followed by the actual packet data. struct xdp_buff is therefore laid out that we first point to data_hard_start, then data_meta directly prepended to data followed by data_end marking the end of packet. bpf_xdp_adjust_head() takes into account whether we have meta data already prepended and if so, memmove()s this along with the given offset provided there's enough room. xdp->data_meta is optional and programs are not required to use it. The rationale is that when we process the packet in XDP (e.g. as DoS filter), we can push further meta data along with it for the XDP_PASS case, and give the guarantee that a clsact ingress BPF program on the same device can pick this up for further post-processing. Since we work with skb there, we can also set skb->mark, skb->priority or other skb meta data out of BPF, thus having this scratch space generic and programmable allows for more flexibility than defining a direct 1:1 transfer of potentially new XDP members into skb (it's also more efficient as we don't need to initialize/handle each of such new members). The facility also works together with GRO aggregation. The scratch space at the head of the packet can be multiple of 4 byte up to 32 byte large. Drivers not yet supporting xdp->data_meta can simply be set up with xdp->data_meta as xdp->data + 1 as bpf_xdp_adjust_meta() will detect this and bail out, such that the subsequent match against xdp->data for later access is guaranteed to fail. The verifier treats xdp->data_meta/xdp->data the same way as we treat xdp->data/xdp->data_end pointer comparisons. The requirement for doing the compare against xdp->data is that it hasn't been modified from it's original address we got from ctx access. It may have a range marking already from prior successful xdp->data/xdp->data_end pointer comparisons though. Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Acked-by: NAlexei Starovoitov <ast@kernel.org> Acked-by: NJohn Fastabend <john.fastabend@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 28 8月, 2017 4 次提交
-
-
由 Jacob Keller 提交于
The dynamic ITR algorithm depends on a calculation of usecs which assumes that the interrupts have been firing constantly at the interrupt throttle rate. This is not guaranteed because we could have a low packet rate, or have been polling in software. We'll estimate whether this is the case by using jiffies to determine if we've been too long. If the time difference of jiffies is larger we are guaranteed to have an incorrect calculation. If the time difference of jiffies is smaller we might have been polling some but the difference shouldn't affect the calculation too much. This ensures that we don't get stuck in BULK latency during certain rare situations where we receive bursts of packets that force us into NAPI polling. Signed-off-by: NJacob Keller <jacob.e.keller@intel.com> Tested-by: NAndrew Bowers <andrewx.bowers@intel.com> Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
-
由 Jacob Keller 提交于
Since commit c56625d5 ("i40e/i40evf: change dynamic interrupt thresholds") a new higher latency ITR setting called I40E_ULTRA_LATENCY was added with a cryptic comment about how it was meant for adjusting Rx more aggressively when streaming small packets. This mode was attempting to calculate packets per second and then kick in when we have a huge number of small packets. Unfortunately, the ULTRA setting was kicking in for workloads it wasn't intended for including single-thread UDP_STREAM workloads. This wasn't caught for a variety of reasons. First, the ip_defrag routines were improved somewhat which makes the UDP_STREAM test still reasonable at 10GbE, even when dropped down to 8k interrupts a second. Additionally, some other obvious workloads appear to work fine, such as TCP_STREAM. The number 40k doesn't make sense for a number of reasons. First, we absolutely can do more than 40k packets per second. Second, we calculate the value inline in an integer, which sometimes can overflow resulting in using incorrect values. If we fix this overflow it makes it even more likely that we'll enter ULTRA mode which is the opposite of what we want. The ULTRA mode was added originally as a way to reduce CPU utilization during a small packet workload where we weren't keeping up anyways. It should never have been kicking in during these other workloads. Given the issues outlined above, let's remove the ULTRA latency mode. If necessary, a better solution to the CPU utilization issue for small packet workloads will be added in a future patch. Signed-off-by: NJacob Keller <jacob.e.keller@intel.com> Tested-by: NAndrew Bowers <andrewx.bowers@intel.com> Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
-
由 Jacob Keller 提交于
In commit 96db776a ("i40e/vf: fix interrupt affinity bug") we added some code to force exit of polling in case we did not have the correct CPU. This is important since it was possible for the IRQ affinity to be changed while the CPU is pegged at 100%. This can result in the polling routine being stuck on the wrong CPU until traffic finally stops. Unfortunately, the implementation, "if the CPU is correct, exit as normal, otherwise, fall-through to the end-polling exit" is incredibly confusing to reason about. In this case, the normal flow looks like the exception, while the exception actually occurs far away from the if statement and comment. We recently discovered and fixed a bug in this code because we were incorrectly initializing the affinity mask. Re-write the code so that the exceptional case is handled at the check, rather than having the logic be spread through the regular exit flow. This does end up with minor code duplication, but the resulting code is much easier to reason about. The new logic is identical, but inverted. If we are running on a CPU not in our affinity mask, we'll exit polling. However, the code flow is much easier to understand. Note that we don't actually have to check for MSI-X, because in the MSI case we'll only have one q_vector, but its default affinity mask should be correct as it includes all CPUs when it's initialized. Further, we could at some point add code to setup the notifier for the non-MSI-X case and enable this workaround for that case too, if desired, though there isn't much gain since its unlikely to be the common case. Signed-off-by: NJacob Keller <jacob.e.keller@intel.com> Tested-by: NAndrew Bowers <andrewx.bowers@intel.com> Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
-
由 Jacob Keller 提交于
If we don't have MSI-X enabled, we handle interrupts on all icr0. This is a special case, so let's move the conditional into i40e_update_enable_itr() in order to make i40e_napi_poll easier to read about. Signed-off-by: NJacob Keller <jacob.e.keller@intel.com> Tested-by: NAndrew Bowers <andrewx.bowers@intel.com> Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
-
- 02 8月, 2017 1 次提交
-
-
由 Florian Fainelli 提交于
On 32-bit hosts and with CONFIG_DEBUG_LOCK_ALLOC we should be seeing a lockdep splat indicating this seqcount is not correctly initialized, fix that. Fixes: 980e9b11 ("i40e: Add support for 64 bit netstats") Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 26 7月, 2017 2 次提交
-
-
由 Jesse Brandeburg 提交于
Compiler reported several places where driver compared signed and unsigned types. Cast or change the types to remove the warnings. Signed-off-by: NJesse Brandeburg <jesse.brandeburg@intel.com> Tested-by: NAndrew Bowers <andrewx.bowers@intel.com> Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
-
由 Jesse Brandeburg 提交于
This just reorders some local vars and makes the code flow clearer. Signed-off-by: NJesse Brandeburg <jesse.brandeburg@intel.com> Tested-by: NAndrew Bowers <andrewx.bowers@intel.com> Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
-
- 21 6月, 2017 1 次提交
-
-
由 Björn Töpel 提交于
This patch adds proper XDP_TX action support. For each Tx ring, an additional XDP Tx ring is allocated and setup. This version does the DMA mapping in the fast-path, which will penalize performance for IOMMU enabled systems. Further, debugfs support is not wired up for the XDP Tx rings. Signed-off-by: NBjörn Töpel <bjorn.topel@intel.com> Tested-by: NAndrew Bowers <andrewx.bowers@intel.com> Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
-