1. 30 8月, 2018 3 次提交
  2. 08 8月, 2018 1 次提交
  3. 28 6月, 2018 1 次提交
  4. 20 6月, 2018 1 次提交
  5. 05 6月, 2018 2 次提交
  6. 03 6月, 2018 2 次提交
  7. 25 5月, 2018 1 次提交
    • J
      xdp: change ndo_xdp_xmit API to support bulking · 735fc405
      Jesper Dangaard Brouer 提交于
      This patch change the API for ndo_xdp_xmit to support bulking
      xdp_frames.
      
      When kernel is compiled with CONFIG_RETPOLINE, XDP sees a huge slowdown.
      Most of the slowdown is caused by DMA API indirect function calls, but
      also the net_device->ndo_xdp_xmit() call.
      
      Benchmarked patch with CONFIG_RETPOLINE, using xdp_redirect_map with
      single flow/core test (CPU E5-1650 v4 @ 3.60GHz), showed
      performance improved:
       for driver ixgbe: 6,042,682 pps -> 6,853,768 pps = +811,086 pps
       for driver i40e : 6,187,169 pps -> 6,724,519 pps = +537,350 pps
      
      With frames avail as a bulk inside the driver ndo_xdp_xmit call,
      further optimizations are possible, like bulk DMA-mapping for TX.
      
      Testing without CONFIG_RETPOLINE show the same performance for
      physical NIC drivers.
      
      The virtual NIC driver tun sees a huge performance boost, as it can
      avoid doing per frame producer locking, but instead amortize the
      locking cost over the bulk.
      
      V2: Fix compile errors reported by kbuild test robot <lkp@intel.com>
      V4: Isolated ndo, driver changes and callers.
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      735fc405
  8. 01 5月, 2018 1 次提交
  9. 28 4月, 2018 1 次提交
  10. 17 4月, 2018 3 次提交
    • J
      xdp: transition into using xdp_frame for ndo_xdp_xmit · 44fa2dbd
      Jesper Dangaard Brouer 提交于
      Changing API ndo_xdp_xmit to take a struct xdp_frame instead of struct
      xdp_buff.  This brings xdp_return_frame and ndp_xdp_xmit in sync.
      
      This builds towards changing the API further to become a bulk API,
      because xdp_buff is not a queue-able object while xdp_frame is.
      
      V4: Adjust for commit 59655a5b ("tuntap: XDP_TX can use native XDP")
      V7: Adjust for commit d9314c47 ("i40e: add support for XDP_REDIRECT")
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      44fa2dbd
    • J
      xdp: transition into using xdp_frame for return API · 03993094
      Jesper Dangaard Brouer 提交于
      Changing API xdp_return_frame() to take struct xdp_frame as argument,
      seems like a natural choice. But there are some subtle performance
      details here that needs extra care, which is a deliberate choice.
      
      When de-referencing xdp_frame on a remote CPU during DMA-TX
      completion, result in the cache-line is change to "Shared"
      state. Later when the page is reused for RX, then this xdp_frame
      cache-line is written, which change the state to "Modified".
      
      This situation already happens (naturally) for, virtio_net, tun and
      cpumap as the xdp_frame pointer is the queued object.  In tun and
      cpumap, the ptr_ring is used for efficiently transferring cache-lines
      (with pointers) between CPUs. Thus, the only option is to
      de-referencing xdp_frame.
      
      It is only the ixgbe driver that had an optimization, in which it can
      avoid doing the de-reference of xdp_frame.  The driver already have
      TX-ring queue, which (in case of remote DMA-TX completion) have to be
      transferred between CPUs anyhow.  In this data area, we stored a
      struct xdp_mem_info and a data pointer, which allowed us to avoid
      de-referencing xdp_frame.
      
      To compensate for this, a prefetchw is used for telling the cache
      coherency protocol about our access pattern.  My benchmarks show that
      this prefetchw is enough to compensate the ixgbe driver.
      
      V7: Adjust for commit d9314c47 ("i40e: add support for XDP_REDIRECT")
      V8: Adjust for commit bd658dda ("net/mlx5e: Separate dma base address
      and offset in dma_sync call")
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      03993094
    • J
      i40e: convert to use generic xdp_frame and xdp_return_frame API · b411ef11
      Jesper Dangaard Brouer 提交于
      Also convert driver i40e, which very recently got XDP_REDIRECT support
      in commit d9314c47 ("i40e: add support for XDP_REDIRECT").
      
      V7: This patch got added in V7 of this patchset.
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b411ef11
  11. 27 3月, 2018 3 次提交
  12. 24 3月, 2018 1 次提交
  13. 27 2月, 2018 1 次提交
    • A
      i40e/i40evf: use SW variables for hang detection · 04d41051
      Alan Brady 提交于
      The i40e_detect_recover_hung function uses the i40e_get_tx_pending
      function to determine if there are packets stalled on the ring.
      i40e_get_tx_pending calculates the pending packets using the head
      writeback value and HW tail.  If the queue is stopped and we lose the
      interrupt to update our next_to_clean then we a) won't get another
      interrupt to clean because queue is stopped b) we won't catch the
      problem with i40e_detect_recover_hung because the HW values look like
      there's no packets waiting to be transmitted.  Using the SW values we
      can catch the issue because next_to_clean will be out of sync with head
      writeback.
      
      This has the added benefit being less CPU intensive because we don't
      need to reach into the hardware to get the values.
      Signed-off-by: NAlan Brady <alan.brady@intel.com>
      Tested-by: NAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
      04d41051
  14. 13 2月, 2018 7 次提交
  15. 27 1月, 2018 1 次提交
  16. 24 1月, 2018 1 次提交
    • S
      i40e/i40evf: Detect and recover hung queue scenario · 07d44190
      Sudheer Mogilappagari 提交于
      In VFs, there is a known issue which can cause writebacks
      to not occur when interrupts are disabled and there are
      less than 4 descriptors resulting in TX timeout. Timeout
      can also occur due to lost interrupt.
      
      The current implementation for detecting and recovering
      from hung queues in the PF is problematic because it actually
      actively encourages lost interrupts.  By triggering a SW
      interrupt, interrupts are forced on.  If we are already in
      napi_poll and an interrupt fires, napi_poll will not be
      rescheduled and the interrupt is effectively lost; thereby
      potentially *causing* hung queues.
      
      This patch checks whether packets are being processed between
      every watchdog cycle and determine potential hung queue and
      fires triggers SW interrupt only for that particular queue.
      Signed-off-by: NSudheer Mogilappagari <sudheer.mogilappagari@intel.com>
      Tested-by: NAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
      07d44190
  17. 06 1月, 2018 1 次提交
    • J
      i40e: setup xdp_rxq_info · 87128824
      Jesper Dangaard Brouer 提交于
      The i40e driver has a special "FDIR" RX-ring (I40E_VSI_FDIR) which is
      a sideband channel for configuring/updating the flow director tables.
      This (i40e_vsi_)type does not invoke XDP-ebpf code.
      
      As suggested by Björn (V2): Instead of marking this I40E_VSI_FDIR RX-ring
      a special case, reverse the logic and only select RX-rings of type
      I40E_VSI_MAIN to register xdp_rxq_info's for.
      
      Driver hook points for xdp_rxq_info:
       * reg  : i40e_setup_rx_descriptors (via i40e_vsi_setup_rx_resources)
       * unreg: i40e_free_rx_resources    (via i40e_vsi_free_rx_resources)
      
      Tested on actual hardware with samples/bpf program.
      
      V2: Fixed bug in i40e_set_ringparam (memset zero) + match on I40E_VSI_MAIN.
      V4: Update patch desc that got out-of-sync with code.
      
      Cc: intel-wired-lan@lists.osuosl.org
      Cc: Björn Töpel <bjorn.topel@intel.com>
      Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
      Cc: Paul Menzel <pmenzel@molgen.mpg.de>
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Reviewed-by: NPaul Menzel <pmenzel@molgen.mpg.de>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      87128824
  18. 04 1月, 2018 1 次提交
    • A
      i40e/i40evf: Account for frags split over multiple descriptors in check linearize · 248de22e
      Alexander Duyck 提交于
      The original code for __i40e_chk_linearize didn't take into account the
      fact that if a fragment is 16K in size or larger it has to be split over 2
      descriptors and the smaller of those 2 descriptors will be on the trailing
      edge of the transmit. As a result we can get into situations where we didn't
      catch requests that could result in a Tx hang.
      
      This patch takes care of that by subtracting the length of all but the
      trailing edge of the stale fragment before we test for sum. By doing this
      we can guarantee that we have all cases covered, including the case of a
      fragment that spans multiple descriptors. We don't need to worry about
      checking the inner portions of this since 12K is the maximum aligned DMA
      size and that is larger than any MSS will ever be since the MTU limit for
      jumbos is something on the order of 9K.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@intel.com>
      Tested-by: NAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
      248de22e
  19. 22 11月, 2017 1 次提交
  20. 01 11月, 2017 1 次提交
  21. 26 10月, 2017 2 次提交
  22. 10 10月, 2017 3 次提交
    • A
      i40e: Fix memory leak related filter programming status · 2b9478ff
      Alexander Duyck 提交于
      It looks like we weren't correctly placing the pages from buffers that had
      been used to return a filter programming status back on the ring. As a
      result they were being overwritten and tracking of the pages was lost.
      
      This change works to correct that by incorporating part of
      i40e_put_rx_buffer into the programming status handler code. As a result we
      should now be correctly placing the pages for those buffers on the
      re-allocation list instead of letting them stay in place.
      
      Fixes: 0e626ff7 ("i40e: Fix support for flow director programming status")
      Reported-by: NAnders K. Pedersen <akp@cohaesio.com>
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@intel.com>
      Tested-by: NAnders K Pedersen <akp@cohaesio.com>
      Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
      2b9478ff
    • J
      i40e/i40evf: bump tail only in multiples of 8 · 11f29003
      Jacob Keller 提交于
      Hardware only fetches descriptors on cachelines of 8, essentially
      ignoring the lower 3 bits of the tail register. Thus, it is pointless to
      bump tail by an unaligned access as the hardware will ignore some of the
      new descriptors we allocated. Thus, it's ideal if we can ensure tail
      writes are always aligned to 8.
      
      At first, it seems like we'd already do this, since we allocate
      descriptors in batches which are a multiple of 8. Since we'd always
      increment by a multiple of 8, it seems like the value should always be
      aligned.
      
      However, this ignores allocation failures. If we fail to allocate
      a buffer, our tail register will become unaligned. Once it has become
      unaligned it will essentially be stuck unaligned until a buffer
      allocation happens to fail at the exact amount necessary to re-align it.
      
      We can do better, by simply rounding down the number of buffers we're
      about to allocate (cleaned_count) such that "next_to_clean
      + cleaned_count" is rounded to the nearest multiple of 8.
      
      We do this by calculating how far off that value is and subtracting it
      from the cleaned_count. This essentially defers allocation of buffers if
      they're going to be ignored by hardware anyways, and re-aligns our
      next_to_use and tail values after a failure to allocate a descriptor.
      
      This calculation ensures that we always align the tail writes in a way
      the hardware expects and don't unnecessarily allocate buffers which
      won't be fetched immediately.
      Signed-off-by: NJacob Keller <jacob.e.keller@intel.com>
      Tested-by: NAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
      11f29003
    • J
      i40e/i40evf: always set the CLEARPBA flag when re-enabling interrupts · dbadbbe2
      Jacob Keller 提交于
      In the past we changed driver behavior to not clear the PBA when
      re-enabling interrupts. This change was motivated by the flawed belief
      that clearing the PBA would cause a lost interrupt if a receive
      interrupt occurred while interrupts were disabled.
      
      According to empirical testing this isn't the case. Additionally, the
      data sheet specifically says that we should set the CLEARPBA bit when
      re-enabling interrupts in a polling setup.
      
      This reverts commit 40d72a50 ("i40e/i40evf: don't lose interrupts")
      Signed-off-by: NJacob Keller <jacob.e.keller@intel.com>
      Tested-by: NAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
      dbadbbe2
  23. 06 10月, 2017 1 次提交
    • J
      i40e: ignore skb->xmit_more when deciding to set RS bit · a5340d93
      Jacob Keller 提交于
      Since commit 6a7fded7 ("i40e: Fix RS bit update in Tx path and
      disable force WB workaround") we've tried to "optimize" setting the
      RS bit based around skb->xmit_more. This same logic was refactored
      in commit 1dc8b538 ("i40e: Reorder logic for coalescing RS bits"),
      but ultimately was not functionally changed.
      
      Using skb->xmit_more in this way is incorrect, because in certain
      circumstances we may see a large number of skbs in sequence with
      xmit_more set. This leads to a performance loss as the hardware does not
      writeback anything for those packets, which delays the time it takes for
      us to respond to the stack transmit requests. This significantly impacts
      UDP performance, especially when layered with multiple devices, such as
      bonding, VLANs, and vnet setups.
      
      This was not noticed until now because it is difficult to create a setup
      which reproduces the issue. It was discovered in a UDP_STREAM test in
      a VM, connected using a vnet device to a bridge, which is connected to
      a bonded pair of X710 ports in active-backup mode with a VLAN. These
      layered devices seem to compound the number of skbs transmitted at once
      by the qdisc. Additionally, the problem can be masked by reducing the
      ITR value.
      
      Since the original commit does not provide strong justification for this
      RS bit "optimization", revert to the previous behavior of setting the RS
      bit every 4th packet.
      Signed-off-by: NJacob Keller <jacob.e.keller@intel.com>
      Tested-by: NAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
      a5340d93