1. 17 4月, 2018 38 次提交
    • J
      xdp: transition into using xdp_frame for ndo_xdp_xmit · 44fa2dbd
      Jesper Dangaard Brouer 提交于
      Changing API ndo_xdp_xmit to take a struct xdp_frame instead of struct
      xdp_buff.  This brings xdp_return_frame and ndp_xdp_xmit in sync.
      
      This builds towards changing the API further to become a bulk API,
      because xdp_buff is not a queue-able object while xdp_frame is.
      
      V4: Adjust for commit 59655a5b ("tuntap: XDP_TX can use native XDP")
      V7: Adjust for commit d9314c47 ("i40e: add support for XDP_REDIRECT")
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      44fa2dbd
    • J
      xdp: transition into using xdp_frame for return API · 03993094
      Jesper Dangaard Brouer 提交于
      Changing API xdp_return_frame() to take struct xdp_frame as argument,
      seems like a natural choice. But there are some subtle performance
      details here that needs extra care, which is a deliberate choice.
      
      When de-referencing xdp_frame on a remote CPU during DMA-TX
      completion, result in the cache-line is change to "Shared"
      state. Later when the page is reused for RX, then this xdp_frame
      cache-line is written, which change the state to "Modified".
      
      This situation already happens (naturally) for, virtio_net, tun and
      cpumap as the xdp_frame pointer is the queued object.  In tun and
      cpumap, the ptr_ring is used for efficiently transferring cache-lines
      (with pointers) between CPUs. Thus, the only option is to
      de-referencing xdp_frame.
      
      It is only the ixgbe driver that had an optimization, in which it can
      avoid doing the de-reference of xdp_frame.  The driver already have
      TX-ring queue, which (in case of remote DMA-TX completion) have to be
      transferred between CPUs anyhow.  In this data area, we stored a
      struct xdp_mem_info and a data pointer, which allowed us to avoid
      de-referencing xdp_frame.
      
      To compensate for this, a prefetchw is used for telling the cache
      coherency protocol about our access pattern.  My benchmarks show that
      this prefetchw is enough to compensate the ixgbe driver.
      
      V7: Adjust for commit d9314c47 ("i40e: add support for XDP_REDIRECT")
      V8: Adjust for commit bd658dda ("net/mlx5e: Separate dma base address
      and offset in dma_sync call")
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      03993094
    • J
      mlx5: use page_pool for xdp_return_frame call · 60bbf7ee
      Jesper Dangaard Brouer 提交于
      This patch shows how it is possible to have both the driver local page
      cache, which uses elevated refcnt for "catching"/avoiding SKB
      put_page returns the page through the page allocator.  And at the
      same time, have pages getting returned to the page_pool from
      ndp_xdp_xmit DMA completion.
      
      The performance improvement for XDP_REDIRECT in this patch is really
      good.  Especially considering that (currently) the xdp_return_frame
      API and page_pool_put_page() does per frame operations of both
      rhashtable ID-lookup and locked return into (page_pool) ptr_ring.
      (It is the plan to remove these per frame operation in a followup
      patchset).
      
      The benchmark performed was RX on mlx5 and XDP_REDIRECT out ixgbe,
      with xdp_redirect_map (using devmap) . And the target/maximum
      capability of ixgbe is 13Mpps (on this HW setup).
      
      Before this patch for mlx5, XDP redirected frames were returned via
      the page allocator.  The single flow performance was 6Mpps, and if I
      started two flows the collective performance drop to 4Mpps, because we
      hit the page allocator lock (further negative scaling occurs).
      
      Two test scenarios need to be covered, for xdp_return_frame API, which
      is DMA-TX completion running on same-CPU or cross-CPU free/return.
      Results were same-CPU=10Mpps, and cross-CPU=12Mpps.  This is very
      close to our 13Mpps max target.
      
      The reason max target isn't reached in cross-CPU test, is likely due
      to RX-ring DMA unmap/map overhead (which doesn't occur in ixgbe to
      ixgbe testing).  It is also planned to remove this unnecessary DMA
      unmap in a later patchset
      
      V2: Adjustments requested by Tariq
       - Changed page_pool_create return codes not return NULL, only
         ERR_PTR, as this simplifies err handling in drivers.
       - Save a branch in mlx5e_page_release
       - Correct page_pool size calc for MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ
      
      V5: Updated patch desc
      
      V8: Adjust for b0cedc84 ("net/mlx5e: Remove rq_headroom field from params")
      V9:
       - Adjust for 121e8927 ("net/mlx5e: Refactor RQ XDP_TX indication")
       - Adjust for 73281b78 ("net/mlx5e: Derive Striding RQ size from MTU")
       - Correct handling if page_pool_create fail for MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ
      
      V10: Req from Tariq
       - Change pool_size calc for MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Reviewed-by: NTariq Toukan <tariqt@mellanox.com>
      Acked-by: NSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      60bbf7ee
    • J
      xdp: allow page_pool as an allocator type in xdp_return_frame · 57d0a1c1
      Jesper Dangaard Brouer 提交于
      New allocator type MEM_TYPE_PAGE_POOL for page_pool usage.
      
      The registered allocator page_pool pointer is not available directly
      from xdp_rxq_info, but it could be (if needed).  For now, the driver
      should keep separate track of the page_pool pointer, which it should
      use for RX-ring page allocation.
      
      As suggested by Saeed, to maintain a symmetric API it is the drivers
      responsibility to allocate/create and free/destroy the page_pool.
      Thus, after the driver have called xdp_rxq_info_unreg(), it is drivers
      responsibility to free the page_pool, but with a RCU free call.  This
      is done easily via the page_pool helper page_pool_destroy() (which
      avoids touching any driver code during the RCU callback, which could
      happen after the driver have been unloaded).
      
      V8: address issues found by kbuild test robot
       - Address sparse should be static warnings
       - Allow xdp.o to be compiled without page_pool.o
      
      V9: Remove inline from .c file, compiler knows best
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      57d0a1c1
    • J
      page_pool: refurbish version of page_pool code · ff7d6b27
      Jesper Dangaard Brouer 提交于
      Need a fast page recycle mechanism for ndo_xdp_xmit API for returning
      pages on DMA-TX completion time, which have good cross CPU
      performance, given DMA-TX completion time can happen on a remote CPU.
      
      Refurbish my page_pool code, that was presented[1] at MM-summit 2016.
      Adapted page_pool code to not depend the page allocator and
      integration into struct page.  The DMA mapping feature is kept,
      even-though it will not be activated/used in this patchset.
      
      [1] http://people.netfilter.org/hawk/presentations/MM-summit2016/generic_page_pool_mm_summit2016.pdf
      
      V2: Adjustments requested by Tariq
       - Changed page_pool_create return codes, don't return NULL, only
         ERR_PTR, as this simplifies err handling in drivers.
      
      V4: many small improvements and cleanups
      - Add DOC comment section, that can be used by kernel-doc
      - Improve fallback mode, to work better with refcnt based recycling
        e.g. remove a WARN as pointed out by Tariq
        e.g. quicker fallback if ptr_ring is empty.
      
      V5: Fixed SPDX license as pointed out by Alexei
      
      V6: Adjustments requested by Eric Dumazet
       - Adjust ____cacheline_aligned_in_smp usage/placement
       - Move rcu_head in struct page_pool
       - Free pages quicker on destroy, minimize resources delayed an RCU period
       - Remove code for forward/backward compat ABI interface
      
      V8: Issues found by kbuild test robot
       - Address sparse should be static warnings
       - Only compile+link when a driver use/select page_pool,
         mlx5 selects CONFIG_PAGE_POOL, although its first used in two patches
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ff7d6b27
    • J
      xdp: rhashtable with allocator ID to pointer mapping · 8d5d8852
      Jesper Dangaard Brouer 提交于
      Use the IDA infrastructure for getting a cyclic increasing ID number,
      that is used for keeping track of each registered allocator per
      RX-queue xdp_rxq_info.  Instead of using the IDR infrastructure, which
      uses a radix tree, use a dynamic rhashtable, for creating ID to
      pointer lookup table, because this is faster.
      
      The problem that is being solved here is that, the xdp_rxq_info
      pointer (stored in xdp_buff) cannot be used directly, as the
      guaranteed lifetime is too short.  The info is needed on a
      (potentially) remote CPU during DMA-TX completion time . In an
      xdp_frame the xdp_mem_info is stored, when it got converted from an
      xdp_buff, which is sufficient for the simple page refcnt based recycle
      schemes.
      
      For more advanced allocators there is a need to store a pointer to the
      registered allocator.  Thus, there is a need to guard the lifetime or
      validity of the allocator pointer, which is done through this
      rhashtable ID map to pointer. The removal and validity of of the
      allocator and helper struct xdp_mem_allocator is guarded by RCU.  The
      allocator will be created by the driver, and registered with
      xdp_rxq_info_reg_mem_model().
      
      It is up-to debate who is responsible for freeing the allocator
      pointer or invoking the allocator destructor function.  In any case,
      this must happen via RCU freeing.
      
      Use the IDA infrastructure for getting a cyclic increasing ID number,
      that is used for keeping track of each registered allocator per
      RX-queue xdp_rxq_info.
      
      V4: Per req of Jason Wang
      - Use xdp_rxq_info_reg_mem_model() in all drivers implementing
        XDP_REDIRECT, even-though it's not strictly necessary when
        allocator==NULL for type MEM_TYPE_PAGE_SHARED (given it's zero).
      
      V6: Per req of Alex Duyck
      - Introduce rhashtable_lookup() call in later patch
      
      V8: Address sparse should be static warnings (from kbuild test robot)
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8d5d8852
    • J
      mlx5: register a memory model when XDP is enabled · 84f5e3fb
      Jesper Dangaard Brouer 提交于
      Now all the users of ndo_xdp_xmit have been converted to use xdp_return_frame.
      This enable a different memory model, thus activating another code path
      in the xdp_return_frame API.
      
      V2: Fixed issues pointed out by Tariq.
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Reviewed-by: NTariq Toukan <tariqt@mellanox.com>
      Acked-by: NSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      84f5e3fb
    • J
      i40e: convert to use generic xdp_frame and xdp_return_frame API · b411ef11
      Jesper Dangaard Brouer 提交于
      Also convert driver i40e, which very recently got XDP_REDIRECT support
      in commit d9314c47 ("i40e: add support for XDP_REDIRECT").
      
      V7: This patch got added in V7 of this patchset.
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b411ef11
    • J
      bpf: cpumap convert to use generic xdp_frame · 70280ed9
      Jesper Dangaard Brouer 提交于
      The generic xdp_frame format, was inspired by the cpumap own internal
      xdp_pkt format.  It is now time to convert it over to the generic
      xdp_frame format.  The cpumap needs one extra field dev_rx.
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      70280ed9
    • J
      virtio_net: convert to use generic xdp_frame and xdp_return_frame API · cac320c8
      Jesper Dangaard Brouer 提交于
      The virtio_net driver assumes XDP frames are always released based on
      page refcnt (via put_page).  Thus, is only queues the XDP data pointer
      address and uses virt_to_head_page() to retrieve struct page.
      
      Use the XDP return API to get away from such assumptions. Instead
      queue an xdp_frame, which allow us to use the xdp_return_frame API,
      when releasing the frame.
      
      V8: Avoid endianness issues (found by kbuild test robot)
      V9: Change __virtnet_xdp_xmit from bool to int return value (found by Dan Carpenter)
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cac320c8
    • J
      tun: convert to use generic xdp_frame and xdp_return_frame API · 1ffcbc85
      Jesper Dangaard Brouer 提交于
      The tuntap driver invented it's own driver specific way of queuing
      XDP packets, by storing the xdp_buff information in the top of
      the XDP frame data.
      
      Convert it over to use the more generic xdp_frame structure.  The
      main problem with the in-driver method is that the xdp_rxq_info pointer
      cannot be trused/used when dequeueing the frame.
      
      V3: Remove check based on feedback from Jason
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1ffcbc85
    • J
      xdp: introduce a new xdp_frame type · c0048cff
      Jesper Dangaard Brouer 提交于
      This is needed to convert drivers tuntap and virtio_net.
      
      This is a generalization of what is done inside cpumap, which will be
      converted later.
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c0048cff
    • J
      xdp: move struct xdp_buff from filter.h to xdp.h · 106ca27f
      Jesper Dangaard Brouer 提交于
      This is done to prepare for the next patch, and it is also
      nice to move this XDP related struct out of filter.h.
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      106ca27f
    • J
      ixgbe: use xdp_return_frame API · 189ead81
      Jesper Dangaard Brouer 提交于
      Extend struct ixgbe_tx_buffer to store the xdp_mem_info.
      
      Notice that this could be optimized further by putting this into
      a union in the struct ixgbe_tx_buffer, but this patchset
      works towards removing this again.  Thus, this is not done.
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      189ead81
    • J
      xdp: introduce xdp_return_frame API and use in cpumap · 5ab073ff
      Jesper Dangaard Brouer 提交于
      Introduce an xdp_return_frame API, and convert over cpumap as
      the first user, given it have queued XDP frame structure to leverage.
      
      V3: Cleanup and remove C99 style comments, pointed out by Alex Duyck.
      V6: Remove comment that id will be added later (Req by Alex Duyck)
      V8: Rename enum mem_type to xdp_mem_type (found by kbuild test robot)
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5ab073ff
    • J
      mlx5: basic XDP_REDIRECT forward support · 5168d732
      Jesper Dangaard Brouer 提交于
      This implements basic XDP redirect support in mlx5 driver.
      
      Notice that the ndo_xdp_xmit() is NOT implemented, because that API
      need some changes that this patchset is working towards.
      
      The main purpose of this patch is have different drivers doing
      XDP_REDIRECT to show how different memory models behave in a cross
      driver world.
      
      Update(pre-RFCv2 Tariq): Need to DMA unmap page before xdp_do_redirect,
      as the return API does not exist yet to to keep this mapped.
      
      Update(pre-RFCv3 Saeed): Don't mix XDP_TX and XDP_REDIRECT flushing,
      introduce xdpsq.db.redirect_flush boolian.
      
      V9: Adjust for commit 121e8927 ("net/mlx5e: Refactor RQ XDP_TX indication")
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Reviewed-by: NTariq Toukan <tariqt@mellanox.com>
      Acked-by: NSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5168d732
    • I
      liquidio: Enhanced ethtool stats · 897ddc24
      Intiyaz Basha 提交于
      1. Added red_drops stats. Inbound packets dropped by RED, buffer exhaustion
      2. Included fcs_err, jabber_err, l2_err and frame_err errors under
         rx_errors
      3. Included fifo_err, dmac_drop, red_drops, fw_err_pko, fw_err_link and
         fw_err_drop under rx_dropped
      4. Included max_collision_fail, max_deferral_fail, total_collisions,
         fw_err_pko, fw_err_link, fw_err_drop and fw_err_pki under tx_dropped
      5. Counting dma mapping errors
      6. Added some firmware stats description and removed for some
      Signed-off-by: NIntiyaz Basha <intiyaz.basha@cavium.com>
      Acked-by: NDerek Chickles <derek.chickles@cavium.com>
      Acked-by: NSatanand Burla <satananda.burla@cavium.com>
      Signed-off-by: NFelix Manlunas <felix.manlunas@cavium.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      897ddc24
    • A
      net: Remove unused tcp_set_state tracepoint · ef53e9e1
      Andrey Ignatov 提交于
      This tracepoint was replaced by inet_sock_set_state in 563e0bb0 and not
      used anywhere in the kernel anymore. Remove it.
      Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ef53e9e1
    • D
      Merge branch 'pci-mrrs-consts' · 4c85d2d4
      David S. Miller 提交于
      Heiner Kallweit says:
      
      ====================
      PCI: add two more values for PCIe Max_Read_Request_Size and initially use them in r8169 network driver
      
      In r8169 network driver I stumbled across a magic number translating
      to PCI MRRS size 4K. The PCI core is still missing constants for
      values 2K and 4K (as defined in PCI standard).
      
      So let's add these two constants and use the 4K constant in r8169.
      
      Second patch depends on the first one, therefore both patches
      preferrably should go through either PCI or netdev tree.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4c85d2d4
    • H
      r8169: replace magic numbers with PCI MRRS constant · 8d98aa39
      Heiner Kallweit 提交于
      Replace magic number "0x5 << MAX_READ_REQUEST_SHIFT" with the
      appropriate constant as defined in PCI core.
      Signed-off-by: NHeiner Kallweit <hkallweit1@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8d98aa39
    • H
      PCI: Add two more values for PCIe Max_Read_Request_Size · a5724fc3
      Heiner Kallweit 提交于
      This patch adds missing values for the max read request size.
      E.g. network driver r8169 uses a value of 4K.
      Signed-off-by: NHeiner Kallweit <hkallweit1@gmail.com>
      Acked-by: NBjorn Helgaas <bhelgaas@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a5724fc3
    • D
      Merge branch 'net-stmmac-Stop-using-hard-coded-callbacks' · 5da8baa3
      David S. Miller 提交于
      Jose Abreu says:
      
      ====================
      net: stmmac: Stop using hard-coded callbacks
      
      This a starting point for a cleanup and re-organization of stmmac.
      
      In this series we stop using hard-coded callbacks along the code and use
      instead helpers which are defined in a single place ("hwif.h").
      
      This brings several advantages:
      	1) Less typing :)
      	2) Guaranteed function pointer check
      	3) More flexibility
      
      By 2) we stop using the repeated pattern of:
      	if (priv->hw->mac->some_func)
      		priv->hw->mac->some_func(...)
      
      I didn't check but I expect the final .ko will be bigger with this series
      because *all* of function pointers are checked.
      
      Anyway, I hope this can make the code more readable and more flexible now.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5da8baa3
    • J
      net: stmmac: Switch stmmac_mode_ops to generic HW Interface Helpers · 2c520b1c
      Jose Abreu 提交于
      Switch stmmac_mode_ops to generic Hardware Interface Helpers instead of
      using hard-coded callbacks. This makes the code more readable and more
      flexible.
      
      No functional change.
      Signed-off-by: NJose Abreu <joabreu@synopsys.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Joao Pinto <jpinto@synopsys.com>
      Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
      Cc: Alexandre Torgue <alexandre.torgue@st.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2c520b1c
    • J
      net: stmmac: Switch stmmac_hwtimestamp to generic HW Interface Helpers · cc4c9001
      Jose Abreu 提交于
      Switch stmmac_hwtimestamp to generic Hardware Interface Helpers instead
      of using hard-coded callbacks. This makes the code more readable and
      more flexible.
      
      No functional change.
      Signed-off-by: NJose Abreu <joabreu@synopsys.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Joao Pinto <jpinto@synopsys.com>
      Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
      Cc: Alexandre Torgue <alexandre.torgue@st.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cc4c9001
    • J
      net: stmmac: Switch stmmac_ops to generic HW Interface Helpers · c10d4c82
      Jose Abreu 提交于
      Switch stmmac_ops to generic Hardware Interface Helpers instead of using
      hard-coded callbacks. This makes the code more readable and more
      flexible.
      
      No functional change.
      Signed-off-by: NJose Abreu <joabreu@synopsys.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Joao Pinto <jpinto@synopsys.com>
      Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
      Cc: Alexandre Torgue <alexandre.torgue@st.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c10d4c82
    • J
      net: stmmac: Switch stmmac_dma_ops to generic HW Interface Helpers · a4e887fa
      Jose Abreu 提交于
      Switch stmmac_dma_ops to generic Hardware Interface Helpers instead of
      using hard-coded callbacks. This makes the code more readable and more
      flexible.
      
      No functional change.
      Signed-off-by: NJose Abreu <joabreu@synopsys.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Joao Pinto <jpinto@synopsys.com>
      Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
      Cc: Alexandre Torgue <alexandre.torgue@st.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a4e887fa
    • J
      net: stmmac: Switch stmmac_desc_ops to generic HW Interface Helpers · 42de047d
      Jose Abreu 提交于
      Switch stmmac_desc_ops to generic Hardware Interface Helpers instead of
      using hard-coded callbacks. This makes the code more readable and more
      flexible.
      
      No functional change.
      Signed-off-by: NJose Abreu <joabreu@synopsys.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Joao Pinto <jpinto@synopsys.com>
      Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
      Cc: Alexandre Torgue <alexandre.torgue@st.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      42de047d
    • D
      Merge branch 'tcp-zero-copy-receive' · 309c446c
      David S. Miller 提交于
      Eric Dumazet says:
      
      ====================
      tcp: add zero copy receive
      
      This patch series add mmap() support to TCP sockets for RX zero copy.
      
      While tcp_mmap() patch itself is quite small (~100 LOC), optimal support
      for asynchronous mmap() required better SO_RCVLOWAT behavior, and a
      test program to demonstrate how mmap() on TCP sockets can be used.
      
      Note that mmap() (and associated munmap()) calls are adding more
      pressure on per-process VM semaphore, so might not show benefit
      for processus with high number of threads.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      309c446c
    • E
      selftests: net: add tcp_mmap program · 192dc405
      Eric Dumazet 提交于
      This is a reference program showing how mmap() can be used
      on TCP flows to implement receive zero copy.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      192dc405
    • E
      tcp: implement mmap() for zero copy receive · 93ab6cc6
      Eric Dumazet 提交于
      Some networks can make sure TCP payload can exactly fit 4KB pages,
      with well chosen MSS/MTU and architectures.
      
      Implement mmap() system call so that applications can avoid
      copying data without complex splice() games.
      
      Note that a successful mmap( X bytes) on TCP socket is consuming
      bytes, as if recvmsg() has been done. (tp->copied += X)
      
      Only PROT_READ mappings are accepted, as skb page frags
      are fundamentally shared and read only.
      
      If tcp_mmap() finds data that is not a full page, or a patch of
      urgent data, -EINVAL is returned, no bytes are consumed.
      
      Application must fallback to recvmsg() to read the problematic sequence.
      
      mmap() wont block,  regardless of socket being in blocking or
      non-blocking mode. If not enough bytes are in receive queue,
      mmap() would return -EAGAIN, or -EIO if socket is in a state
      where no other bytes can be added into receive queue.
      
      An application might use SO_RCVLOWAT, poll() and/or ioctl( FIONREAD)
      to efficiently use mmap()
      
      On the sender side, MSG_EOR might help to clearly separate unaligned
      headers and 4K-aligned chunks if necessary.
      
      Tested:
      
      mlx4 (cx-3) 40Gbit NIC, with tcp_mmap program provided in following patch.
      MTU set to 4168  (4096 TCP payload, 40 bytes IPv6 header, 32 bytes TCP header)
      
      Without mmap() (tcp_mmap -s)
      
      received 32768 MB (0 % mmap'ed) in 8.13342 s, 33.7961 Gbit,
        cpu usage user:0.034 sys:3.778, 116.333 usec per MB, 63062 c-switches
      received 32768 MB (0 % mmap'ed) in 8.14501 s, 33.748 Gbit,
        cpu usage user:0.029 sys:3.997, 122.864 usec per MB, 61903 c-switches
      received 32768 MB (0 % mmap'ed) in 8.11723 s, 33.8635 Gbit,
        cpu usage user:0.048 sys:3.964, 122.437 usec per MB, 62983 c-switches
      received 32768 MB (0 % mmap'ed) in 8.39189 s, 32.7552 Gbit,
        cpu usage user:0.038 sys:4.181, 128.754 usec per MB, 55834 c-switches
      
      With mmap() on receiver (tcp_mmap -s -z)
      
      received 32768 MB (100 % mmap'ed) in 8.03083 s, 34.2278 Gbit,
        cpu usage user:0.024 sys:1.466, 45.4712 usec per MB, 65479 c-switches
      received 32768 MB (100 % mmap'ed) in 7.98805 s, 34.4111 Gbit,
        cpu usage user:0.026 sys:1.401, 43.5486 usec per MB, 65447 c-switches
      received 32768 MB (100 % mmap'ed) in 7.98377 s, 34.4296 Gbit,
        cpu usage user:0.028 sys:1.452, 45.166 usec per MB, 65496 c-switches
      received 32768 MB (99.9969 % mmap'ed) in 8.01838 s, 34.281 Gbit,
        cpu usage user:0.02 sys:1.446, 44.7388 usec per MB, 65505 c-switches
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      93ab6cc6
    • E
      tcp: avoid extra wakeups for SO_RCVLOWAT users · 03f45c88
      Eric Dumazet 提交于
      SO_RCVLOWAT is properly handled in tcp_poll(), so that POLLIN is only
      generated when enough bytes are available in receive queue, after
      David change (commit c7004482 "tcp: Respect SO_RCVLOWAT in tcp_poll().")
      
      But TCP still calls sk->sk_data_ready() for each chunk added in receive
      queue, meaning thread is awaken, and goes back to sleep shortly after.
      
      Tested:
      
      tcp_mmap test program, receiving 32768 MB of data with SO_RCVLOWAT set to 512KB
      
      -> Should get ~2 wakeups (c-switches) per MB, regardless of how many
      (tiny or big) packets were received.
      
      High speed (mostly full size GRO packets)
      
      received 32768 MB (100 % mmap'ed) in 8.03112 s, 34.2266 Gbit,
        cpu usage user:0.037 sys:1.404, 43.9758 usec per MB, 65497 c-switches
      
      received 32768 MB (99.9954 % mmap'ed) in 7.98453 s, 34.4263 Gbit,
        cpu usage user:0.03 sys:1.422, 44.3115 usec per MB, 65485 c-switches
      
      Low speed (sender is ratelimited and sends 1-MSS at a time, so GRO is not helping)
      
      received 22474.5 MB (100 % mmap'ed) in 6015.35 s, 0.0313414 Gbit,
        cpu usage user:0.05 sys:1.586, 72.7952 usec per MB, 44950 c-switches
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      03f45c88
    • E
      tcp: fix delayed acks behavior for SO_RCVLOWAT · 796f82ea
      Eric Dumazet 提交于
      We should not delay acks if there are not enough bytes
      in receive queue to satisfy SO_RCVLOWAT.
      
      Since [E]POLLIN event is not going to be generated, there is little
      hope for a delayed ack to be useful.
      
      In fact, delaying ACK prevents sender from completing
      the transfer.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      796f82ea
    • E
      tcp: fix SO_RCVLOWAT and RCVBUF autotuning · d1361840
      Eric Dumazet 提交于
      Applications might use SO_RCVLOWAT on TCP socket hoping to receive
      one [E]POLLIN event only when a given amount of bytes are ready in socket
      receive queue.
      
      Problem is that receive autotuning is not aware of this constraint,
      meaning sk_rcvbuf might be too small to allow all bytes to be stored.
      
      Add a new (struct proto_ops)->set_rcvlowat method so that a protocol
      can override the default setsockopt(SO_RCVLOWAT) behavior.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d1361840
    • R
      tc-testing: add sample action tests · 10b19aea
      Roman Mashak 提交于
      Signed-off-by: NRoman Mashak <mrv@mojatatu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      10b19aea
    • L
      ipv6: remove unnecessary check in addrconf_prefix_rcv_add_addr() · f85f94b8
      Lorenzo Bianconi 提交于
      Remove unnecessary check on update_lft variable in
      addrconf_prefix_rcv_add_addr routine since it is always set to 0.
      Moreover remove update_lft re-initialization to 0
      Signed-off-by: NLorenzo Bianconi <lorenzo.bianconi@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f85f94b8
    • M
      net: socionext: reset hardware in ndo_stop · 9a00b697
      Masahisa KOJIMA 提交于
      When the interface is down, head/tail of the descriptor
      ring address is set to 0 in netsec_netdev_stop().
      But netsec hardware still keeps the previous descriptor
      ring address, so there is inconsistency between driver
      and hardware after interface is up at a later time.
      To address this inconsistency, add netsec_reset_hardware()
      when the interface is down.
      
      In addition, to minimize the reset process,
      add flag to decide whether driver loads the netsec microcode.
      Even if driver resets the netsec hardware, netsec microcode
      keeps resident on RAM, so it is ok we only load the microcode
      at initialization.
      
      This patch is critical for installation over network.
      Signed-off-by: NMasahisa KOJIMA <masahisa.kojima@linaro.org>
      Fixes: 533dd11a ("net: socionext: Add Synquacer NetSec driver")
      Signed-off-by: NJassi Brar <jaswinder.singh@linaro.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9a00b697
    • J
      net: netsec: enable tx-irq during open callback · c009f413
      Jassi Brar 提交于
      Enable TX-irq as well during ndo_open() as we can not count upon
      RX to arrive early enough to trigger the napi. This patch is critical
      for installation over network.
      
      Fixes: 533dd11a ("net: socionext: Add Synquacer NetSec driver")
      Signed-off-by: NJassi Brar <jaswinder.singh@linaro.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c009f413
    • R
      net: mediatek: use of_device_get_match_data() · eda7d46d
      Ryder Lee 提交于
      The usage of of_device_get_match_data() reduce the code size a bit.
      
      Also, the only way to call mtk_probe() is to match an entry in
      of_mtk_match[], so match cannot be NULL.
      Signed-off-by: NRyder Lee <ryder.lee@mediatek.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      eda7d46d
  2. 13 4月, 2018 2 次提交
    • L
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · 5d136594
      Linus Torvalds 提交于
      Pull networking fixes from David Miller:
      
       1) In ip_gre tunnel, handle the conflict between TUNNEL_{SEQ,CSUM} and
          GSO/LLTX properly. From Sabrina Dubroca.
      
       2) Stop properly on error in lan78xx_read_otp(), from Phil Elwell.
      
       3) Don't uncompress in slip before rstate is initialized, from Tejaswi
          Tanikella.
      
       4) When using 1.x firmware on aquantia, issue a deinit before we
          hardware reset the chip, otherwise we break dirty wake WOL. From
          Igor Russkikh.
      
       5) Correct log check in vhost_vq_access_ok(), from Stefan Hajnoczi.
      
       6) Fix ethtool -x crashes in bnxt_en, from Michael Chan.
      
       7) Fix races in l2tp tunnel creation and duplicate tunnel detection,
          from Guillaume Nault.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (22 commits)
        l2tp: fix race in duplicate tunnel detection
        l2tp: fix races in tunnel creation
        tun: send netlink notification when the device is modified
        tun: set the flags before registering the netdevice
        lan78xx: Don't reset the interface on open
        bnxt_en: Fix NULL pointer dereference at bnxt_free_irq().
        bnxt_en: Need to include RDMA rings in bnxt_check_rings().
        bnxt_en: Support max-mtu with VF-reps
        bnxt_en: Ignore src port field in decap filter nodes
        bnxt_en: do not allow wildcard matches for L2 flows
        bnxt_en: Fix ethtool -x crash when device is down.
        vhost: return bool from *_access_ok() functions
        vhost: fix vhost_vq_access_ok() log check
        vhost: Fix vhost_copy_to_user()
        net: aquantia: oops when shutdown on already stopped device
        net: aquantia: Regression on reset with 1.x firmware
        cdc_ether: flag the Cinterion AHS8 modem by gemalto as WWAN
        slip: Check if rstate is initialized before uncompressing
        lan78xx: Avoid spurious kevent 4 "error"
        lan78xx: Correctly indicate invalid OTP
        ...
      5d136594
    • L
      Merge tag 'for-linus-4.17-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip · 67a7a8ff
      Linus Torvalds 提交于
      Pull xen fixes from Juergen Gross:
       "A few fixes of Xen related core code and drivers"
      
      * tag 'for-linus-4.17-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
        xen/pvh: Indicate XENFEAT_linux_rsdp_unrestricted to Xen
        xen/acpi: off by one in read_acpi_id()
        xen/acpi: upload _PSD info for non Dom0 CPUs too
        x86/xen: Delay get_cpu_cap until stack canary is established
        xen: xenbus_dev_frontend: Verify body of XS_TRANSACTION_END
        xen: xenbus: Catch closing of non existent transactions
        xen: xenbus_dev_frontend: Fix XS_TRANSACTION_END handling
      67a7a8ff