1. 01 11月, 2017 1 次提交
  2. 26 10月, 2017 2 次提交
  3. 10 10月, 2017 3 次提交
    • A
      i40e: Fix memory leak related filter programming status · 2b9478ff
      Alexander Duyck 提交于
      It looks like we weren't correctly placing the pages from buffers that had
      been used to return a filter programming status back on the ring. As a
      result they were being overwritten and tracking of the pages was lost.
      
      This change works to correct that by incorporating part of
      i40e_put_rx_buffer into the programming status handler code. As a result we
      should now be correctly placing the pages for those buffers on the
      re-allocation list instead of letting them stay in place.
      
      Fixes: 0e626ff7 ("i40e: Fix support for flow director programming status")
      Reported-by: NAnders K. Pedersen <akp@cohaesio.com>
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@intel.com>
      Tested-by: NAnders K Pedersen <akp@cohaesio.com>
      Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
      2b9478ff
    • J
      i40e/i40evf: bump tail only in multiples of 8 · 11f29003
      Jacob Keller 提交于
      Hardware only fetches descriptors on cachelines of 8, essentially
      ignoring the lower 3 bits of the tail register. Thus, it is pointless to
      bump tail by an unaligned access as the hardware will ignore some of the
      new descriptors we allocated. Thus, it's ideal if we can ensure tail
      writes are always aligned to 8.
      
      At first, it seems like we'd already do this, since we allocate
      descriptors in batches which are a multiple of 8. Since we'd always
      increment by a multiple of 8, it seems like the value should always be
      aligned.
      
      However, this ignores allocation failures. If we fail to allocate
      a buffer, our tail register will become unaligned. Once it has become
      unaligned it will essentially be stuck unaligned until a buffer
      allocation happens to fail at the exact amount necessary to re-align it.
      
      We can do better, by simply rounding down the number of buffers we're
      about to allocate (cleaned_count) such that "next_to_clean
      + cleaned_count" is rounded to the nearest multiple of 8.
      
      We do this by calculating how far off that value is and subtracting it
      from the cleaned_count. This essentially defers allocation of buffers if
      they're going to be ignored by hardware anyways, and re-aligns our
      next_to_use and tail values after a failure to allocate a descriptor.
      
      This calculation ensures that we always align the tail writes in a way
      the hardware expects and don't unnecessarily allocate buffers which
      won't be fetched immediately.
      Signed-off-by: NJacob Keller <jacob.e.keller@intel.com>
      Tested-by: NAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
      11f29003
    • J
      i40e/i40evf: always set the CLEARPBA flag when re-enabling interrupts · dbadbbe2
      Jacob Keller 提交于
      In the past we changed driver behavior to not clear the PBA when
      re-enabling interrupts. This change was motivated by the flawed belief
      that clearing the PBA would cause a lost interrupt if a receive
      interrupt occurred while interrupts were disabled.
      
      According to empirical testing this isn't the case. Additionally, the
      data sheet specifically says that we should set the CLEARPBA bit when
      re-enabling interrupts in a polling setup.
      
      This reverts commit 40d72a50 ("i40e/i40evf: don't lose interrupts")
      Signed-off-by: NJacob Keller <jacob.e.keller@intel.com>
      Tested-by: NAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
      dbadbbe2
  4. 06 10月, 2017 1 次提交
    • J
      i40e: ignore skb->xmit_more when deciding to set RS bit · a5340d93
      Jacob Keller 提交于
      Since commit 6a7fded7 ("i40e: Fix RS bit update in Tx path and
      disable force WB workaround") we've tried to "optimize" setting the
      RS bit based around skb->xmit_more. This same logic was refactored
      in commit 1dc8b538 ("i40e: Reorder logic for coalescing RS bits"),
      but ultimately was not functionally changed.
      
      Using skb->xmit_more in this way is incorrect, because in certain
      circumstances we may see a large number of skbs in sequence with
      xmit_more set. This leads to a performance loss as the hardware does not
      writeback anything for those packets, which delays the time it takes for
      us to respond to the stack transmit requests. This significantly impacts
      UDP performance, especially when layered with multiple devices, such as
      bonding, VLANs, and vnet setups.
      
      This was not noticed until now because it is difficult to create a setup
      which reproduces the issue. It was discovered in a UDP_STREAM test in
      a VM, connected using a vnet device to a bridge, which is connected to
      a bonded pair of X710 ports in active-backup mode with a VLAN. These
      layered devices seem to compound the number of skbs transmitted at once
      by the qdisc. Additionally, the problem can be masked by reducing the
      ITR value.
      
      Since the original commit does not provide strong justification for this
      RS bit "optimization", revert to the previous behavior of setting the RS
      bit every 4th packet.
      Signed-off-by: NJacob Keller <jacob.e.keller@intel.com>
      Tested-by: NAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
      a5340d93
  5. 30 9月, 2017 1 次提交
  6. 27 9月, 2017 1 次提交
    • D
      bpf: add meta pointer for direct access · de8f3a83
      Daniel Borkmann 提交于
      This work enables generic transfer of metadata from XDP into skb. The
      basic idea is that we can make use of the fact that the resulting skb
      must be linear and already comes with a larger headroom for supporting
      bpf_xdp_adjust_head(), which mangles xdp->data. Here, we base our work
      on a similar principle and introduce a small helper bpf_xdp_adjust_meta()
      for adjusting a new pointer called xdp->data_meta. Thus, the packet has
      a flexible and programmable room for meta data, followed by the actual
      packet data. struct xdp_buff is therefore laid out that we first point
      to data_hard_start, then data_meta directly prepended to data followed
      by data_end marking the end of packet. bpf_xdp_adjust_head() takes into
      account whether we have meta data already prepended and if so, memmove()s
      this along with the given offset provided there's enough room.
      
      xdp->data_meta is optional and programs are not required to use it. The
      rationale is that when we process the packet in XDP (e.g. as DoS filter),
      we can push further meta data along with it for the XDP_PASS case, and
      give the guarantee that a clsact ingress BPF program on the same device
      can pick this up for further post-processing. Since we work with skb
      there, we can also set skb->mark, skb->priority or other skb meta data
      out of BPF, thus having this scratch space generic and programmable
      allows for more flexibility than defining a direct 1:1 transfer of
      potentially new XDP members into skb (it's also more efficient as we
      don't need to initialize/handle each of such new members). The facility
      also works together with GRO aggregation. The scratch space at the head
      of the packet can be multiple of 4 byte up to 32 byte large. Drivers not
      yet supporting xdp->data_meta can simply be set up with xdp->data_meta
      as xdp->data + 1 as bpf_xdp_adjust_meta() will detect this and bail out,
      such that the subsequent match against xdp->data for later access is
      guaranteed to fail.
      
      The verifier treats xdp->data_meta/xdp->data the same way as we treat
      xdp->data/xdp->data_end pointer comparisons. The requirement for doing
      the compare against xdp->data is that it hasn't been modified from it's
      original address we got from ctx access. It may have a range marking
      already from prior successful xdp->data/xdp->data_end pointer comparisons
      though.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      de8f3a83
  7. 28 8月, 2017 4 次提交
    • J
      i40e/i40evf: avoid dynamic ITR updates when polling or low packet rate · 742c9875
      Jacob Keller 提交于
      The dynamic ITR algorithm depends on a calculation of usecs which
      assumes that the interrupts have been firing constantly at the interrupt
      throttle rate. This is not guaranteed because we could have a low packet
      rate, or have been polling in software.
      
      We'll estimate whether this is the case by using jiffies to determine if
      we've been too long. If the time difference of jiffies is larger we are
      guaranteed to have an incorrect calculation. If the time difference of
      jiffies is smaller we might have been polling some but the difference
      shouldn't affect the calculation too much.
      
      This ensures that we don't get stuck in BULK latency during certain rare
      situations where we receive bursts of packets that force us into NAPI
      polling.
      Signed-off-by: NJacob Keller <jacob.e.keller@intel.com>
      Tested-by: NAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
      742c9875
    • J
      i40e/i40evf: remove ULTRA latency mode · 0a2c7722
      Jacob Keller 提交于
      Since commit c56625d5 ("i40e/i40evf: change dynamic interrupt
      thresholds") a new higher latency ITR setting called I40E_ULTRA_LATENCY
      was added with a cryptic comment about how it was meant for adjusting Rx
      more aggressively when streaming small packets.
      
      This mode was attempting to calculate packets per second and then kick
      in when we have a huge number of small packets.
      
      Unfortunately, the ULTRA setting was kicking in for workloads it wasn't
      intended for including single-thread UDP_STREAM workloads.
      
      This wasn't caught for a variety of reasons. First, the ip_defrag
      routines were improved somewhat which makes the UDP_STREAM test still
      reasonable at 10GbE, even when dropped down to 8k interrupts a second.
      Additionally, some other obvious workloads appear to work fine, such
      as TCP_STREAM.
      
      The number 40k doesn't make sense for a number of reasons. First, we
      absolutely can do more than 40k packets per second. Second, we calculate
      the value inline in an integer, which sometimes can overflow resulting
      in using incorrect values.
      
      If we fix this overflow it makes it even more likely that we'll enter
      ULTRA mode which is the opposite of what we want.
      
      The ULTRA mode was added originally as a way to reduce CPU utilization
      during a small packet workload where we weren't keeping up anyways. It
      should never have been kicking in during these other workloads.
      
      Given the issues outlined above, let's remove the ULTRA latency mode. If
      necessary, a better solution to the CPU utilization issue for small
      packet workloads will be added in a future patch.
      Signed-off-by: NJacob Keller <jacob.e.keller@intel.com>
      Tested-by: NAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
      0a2c7722
    • J
      i40e: invert logic for checking incorrect cpu vs irq affinity · 6d977729
      Jacob Keller 提交于
      In commit 96db776a ("i40e/vf: fix interrupt affinity bug")
      we added some code to force exit of polling in case we did
      not have the correct CPU. This is important since it was possible for
      the IRQ affinity to be changed while the CPU is pegged at 100%. This can
      result in the polling routine being stuck on the wrong CPU until
      traffic finally stops.
      
      Unfortunately, the implementation, "if the CPU is correct, exit as
      normal, otherwise, fall-through to the end-polling exit" is incredibly
      confusing to reason about. In this case, the normal flow looks like the
      exception, while the exception actually occurs far away from the if
      statement and comment.
      
      We recently discovered and fixed a bug in this code because we were
      incorrectly initializing the affinity mask.
      
      Re-write the code so that the exceptional case is handled at the check,
      rather than having the logic be spread through the regular exit flow.
      This does end up with minor code duplication, but the resulting code is
      much easier to reason about.
      
      The new logic is identical, but inverted. If we are running on a CPU not
      in our affinity mask, we'll exit polling. However, the code flow is much
      easier to understand.
      
      Note that we don't actually have to check for MSI-X, because in the MSI
      case we'll only have one q_vector, but its default affinity mask should
      be correct as it includes all CPUs when it's initialized. Further, we
      could at some point add code to setup the notifier for the non-MSI-X
      case and enable this workaround for that case too, if desired, though
      there isn't much gain since its unlikely to be the common case.
      Signed-off-by: NJacob Keller <jacob.e.keller@intel.com>
      Tested-by: NAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
      6d977729
    • J
      i40e: move enabling icr0 into i40e_update_enable_itr · 9254c0e3
      Jacob Keller 提交于
      If we don't have MSI-X enabled, we handle interrupts on all icr0. This
      is a special case, so let's move the conditional into
      i40e_update_enable_itr() in order to make i40e_napi_poll easier to
      read about.
      Signed-off-by: NJacob Keller <jacob.e.keller@intel.com>
      Tested-by: NAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
      9254c0e3
  8. 02 8月, 2017 1 次提交
  9. 26 7月, 2017 2 次提交
  10. 21 6月, 2017 2 次提交
  11. 13 6月, 2017 1 次提交
    • J
      i40e: fix handling of HW ATR eviction · 6964e53f
      Jacob Keller 提交于
      A recent commit to refactor the driver and remove the hw_disabled_flags
      field accidentally introduced two regressions. First, we overwrote
      pf->flags which removed various key flags including the MSI-X settings.
      
      Additionally, it was intended that we have now two flags,
      HW_ATR_EVICT_CAPABLE and HW_ATR_EVICT_ENABLED, but this was not done,
      and we accidentally were mis-using HW_ATR_EVICT_CAPABLE everywhere.
      
      This patch adds the missing piece, HW_ATR_EVICT_ENABLED, and safely
      updates pf->flags instead of overwriting it.
      
      Without this patch we will have many problems including disabling MSI-X
      support, and we'll attempt to use HW ATR eviction on devices which do
      not support it.
      
      Fixes: 47994c11 ("i40e: remove hw_disabled_flags in favor of using separate flag bits", 2017-04-19)
      Signed-off-by: NJacob Keller <jacob.e.keller@intel.com>
      Tested-by: NAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6964e53f
  12. 06 6月, 2017 1 次提交
  13. 31 5月, 2017 3 次提交
  14. 30 4月, 2017 3 次提交
    • J
      i40e: remove hw_disabled_flags in favor of using separate flag bits · 47994c11
      Jacob Keller 提交于
      The hw_disabled_flags field was added as a way of signifying that
      a feature was automatically or temporarily disabled. However, we
      actually only use this for FDir features. Replace its use with new
      _AUTO_DISABLED flags instead. This is more readable, because you aren't
      setting an *_ENABLED flag to *disable* the feature.
      
      Additionally, clean up a few areas where we used these bits. First, we
      don't really need to set the auto-disable flag for ATR if we're fully
      disabling the feature via ethtool.
      
      Second, we should always clear the auto-disable bits in case they somehow
      got set when the feature was disabled. However, avoid displaying
      a message that we've re-enabled the feature.
      
      Third, we shouldn't be re-enabling ATR in the SB ntuple add flow,
      because it might have been disabled due to space constraints. Instead,
      we should just wait for the fdir_check_and_reenable to be called by the
      watchdog.
      
      Overall, this change allows us to simplify some code by removing an
      extra field we didn't need, and the result should make it more clear as
      to what we're actually doing with these flags.
      Signed-off-by: NJacob Keller <jacob.e.keller@intel.com>
      Tested-by: NAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
      47994c11
    • J
      i40e: use DECLARE_BITMAP for state fields · 0da36b97
      Jacob Keller 提交于
      Instead of assuming our flags fit within an unsigned long, use
      DECLARE_BITMAP which will ensure that we always allocate enough space.
      Additionally, use __I40E_STATE_SIZE__ markers as the last element of the
      enumeration so that the size of the BITMAP is compile-time assigned
      rather than programmer-time assigned. This ensures that potential future
      flag additions do not actually overrun the array. This is especially
      important as 32bit systems would only have 32bit longs instead of 64bit
      longs as we generally have assumed in the prior code.
      
      This change also removes a dereference of the state fields throughout
      the code, so it does have a bit of code churn. The conversions were
      automated using sed replacements with an alternation
      
        s/&(vsi->back|vsi|pf)->state/\1->state/
        s/&adapter->vsi.state/adapter->vsi.state/
      
      For debugfs, we modify the printing so that we can display chunks of the
      state value on new lines. This ensures that we can print the entire set
      of state values. Additionally, we now print them as 08lx to ensure that
      they display nicely.
      Signed-off-by: NJacob Keller <jacob.e.keller@intel.com>
      Tested-by: NAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
      0da36b97
    • J
      i40e: separate PF and VSI state flags · d19cb64b
      Jacob Keller 提交于
      Avoid using the same named flags for both vsi->state and pf->state. This
      makes code review easier, as it is more likely that future authors will
      use the correct state field when checking bits. Previous commits already
      found issues with at least one check, and possibly others may be
      incorrect.
      
      This reduces confusion as it is more clear what each flag represents,
      and which flags are valid for which state field.
      Signed-off-by: NJacob Keller <jacob.e.keller@intel.com>
      Tested-by: NAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
      d19cb64b
  15. 20 4月, 2017 2 次提交
    • S
      i40e/i40evf: Add tracepoints · ed0980c4
      Scott Peterson 提交于
      This patch adds tracepoints to the i40e and i40evf drivers to which
      BPF programs can be attached for feature testing and verification.
      It's expected that an attached BPF program will identify and count or
      log some interesting subset of traffic. The bcc-tools package is
      helpful there for containing all the BPF arcana in a handy Python
      wrapper. Though you can make these tracepoints log trace messages, the
      messages themselves probably won't be very useful (other to verify the
      tracepoint is being called while you're debugging your BPF program).
      
      The idea here is that tracepoints have such low performance cost when
      disabled that we can leave these in the upstream drivers. This may
      eventually enable the instrumentation of unmodified customer systems
      should the need arise to verify a NIC feature is working as expected.
      In general this enables one set of feature verification tools to be
      used on these drivers whether they're built with the kernel or
      separately.
      
      Users are advised against using these tracepoints for anything other
      than a diagnostic tool. They have a performance impact when enabled,
      and their exact placement and form may change as we see how well they
      work in practice for the purposes above.
      
      Change-ID: Id6014a7322c0e6d08068114dd20bd156f2f6435e
      Signed-off-by: NScott Peterson <scott.d.peterson@intel.com>
      Tested-by: NAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
      ed0980c4
    • A
      i40e: Fix support for flow director programming status · 0e626ff7
      Alexander Duyck 提交于
      This patch fixes an issue I introduced when I converted the code over to
      using the length field to determine if a descriptor was done or not. It
      turns out that we are also processing programming descriptors in the Rx
      path and need to have these processed even though the length field will be
      0 on these packets.  What will happen with a programming descriptor is that
      we will receive a descriptor that has the SPH bit set, and the header
      length and packet length fields cleared.
      
      To account for this we should be checking for the bit for split header
      being set even though we aren't actually using header split. This bit is
      set in the length field to indicate if a programming descriptor response is
      contained in the descriptor. Since we don't support header split we don't
      need to perform the extra checks of using a fixed value for the entire
      length field.
      
      In addition I am moving the function for checking if a filter is a
      programming status filter into the i40e_txrx.c file since there is no
      longer support for FCoE it doesn't make sense to keep this file in i40e.h.
      
      Change-ID: I12c359c3dc70adb9d6b92b27324bb2c7f04c1a06
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@intel.com>
      Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
      0e626ff7
  16. 08 4月, 2017 6 次提交
  17. 29 3月, 2017 4 次提交
  18. 28 3月, 2017 2 次提交