提交 · 039930945a72d9af5ff04ae9b9e60658a52e0770 · openanolis / cloud-kernel

17 4月, 2018 37 次提交

xdp: transition into using xdp_frame for return API · 03993094

由 Jesper Dangaard Brouer 提交于 4月 17, 2018

Changing API xdp_return_frame() to take struct xdp_frame as argument,
seems like a natural choice. But there are some subtle performance
details here that needs extra care, which is a deliberate choice.

When de-referencing xdp_frame on a remote CPU during DMA-TX
completion, result in the cache-line is change to "Shared"
state. Later when the page is reused for RX, then this xdp_frame
cache-line is written, which change the state to "Modified".

This situation already happens (naturally) for, virtio_net, tun and
cpumap as the xdp_frame pointer is the queued object.  In tun and
cpumap, the ptr_ring is used for efficiently transferring cache-lines
(with pointers) between CPUs. Thus, the only option is to
de-referencing xdp_frame.

It is only the ixgbe driver that had an optimization, in which it can
avoid doing the de-reference of xdp_frame.  The driver already have
TX-ring queue, which (in case of remote DMA-TX completion) have to be
transferred between CPUs anyhow.  In this data area, we stored a
struct xdp_mem_info and a data pointer, which allowed us to avoid
de-referencing xdp_frame.

To compensate for this, a prefetchw is used for telling the cache
coherency protocol about our access pattern.  My benchmarks show that
this prefetchw is enough to compensate the ixgbe driver.

V7: Adjust for commit d9314c47 ("i40e: add support for XDP_REDIRECT")
V8: Adjust for commit bd658dda ("net/mlx5e: Separate dma base address
and offset in dma_sync call")
Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

03993094

mlx5: use page_pool for xdp_return_frame call · 60bbf7ee

由 Jesper Dangaard Brouer 提交于 4月 17, 2018

This patch shows how it is possible to have both the driver local page
cache, which uses elevated refcnt for "catching"/avoiding SKB
put_page returns the page through the page allocator.  And at the
same time, have pages getting returned to the page_pool from
ndp_xdp_xmit DMA completion.

The performance improvement for XDP_REDIRECT in this patch is really
good.  Especially considering that (currently) the xdp_return_frame
API and page_pool_put_page() does per frame operations of both
rhashtable ID-lookup and locked return into (page_pool) ptr_ring.
(It is the plan to remove these per frame operation in a followup
patchset).

The benchmark performed was RX on mlx5 and XDP_REDIRECT out ixgbe,
with xdp_redirect_map (using devmap) . And the target/maximum
capability of ixgbe is 13Mpps (on this HW setup).

Before this patch for mlx5, XDP redirected frames were returned via
the page allocator.  The single flow performance was 6Mpps, and if I
started two flows the collective performance drop to 4Mpps, because we
hit the page allocator lock (further negative scaling occurs).

Two test scenarios need to be covered, for xdp_return_frame API, which
is DMA-TX completion running on same-CPU or cross-CPU free/return.
Results were same-CPU=10Mpps, and cross-CPU=12Mpps.  This is very
close to our 13Mpps max target.

The reason max target isn't reached in cross-CPU test, is likely due
to RX-ring DMA unmap/map overhead (which doesn't occur in ixgbe to
ixgbe testing).  It is also planned to remove this unnecessary DMA
unmap in a later patchset

V2: Adjustments requested by Tariq
 - Changed page_pool_create return codes not return NULL, only
   ERR_PTR, as this simplifies err handling in drivers.
 - Save a branch in mlx5e_page_release
 - Correct page_pool size calc for MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ

V5: Updated patch desc

V8: Adjust for b0cedc84 ("net/mlx5e: Remove rq_headroom field from params")
V9:
 - Adjust for 121e8927 ("net/mlx5e: Refactor RQ XDP_TX indication")
 - Adjust for 73281b78 ("net/mlx5e: Derive Striding RQ size from MTU")
 - Correct handling if page_pool_create fail for MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ

V10: Req from Tariq
 - Change pool_size calc for MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ
Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
Reviewed-by: NTariq Toukan <tariqt@mellanox.com>
Acked-by: NSaeed Mahameed <saeedm@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

60bbf7ee

xdp: allow page_pool as an allocator type in xdp_return_frame · 57d0a1c1

由 Jesper Dangaard Brouer 提交于 4月 17, 2018

New allocator type MEM_TYPE_PAGE_POOL for page_pool usage.

The registered allocator page_pool pointer is not available directly
from xdp_rxq_info, but it could be (if needed).  For now, the driver
should keep separate track of the page_pool pointer, which it should
use for RX-ring page allocation.

As suggested by Saeed, to maintain a symmetric API it is the drivers
responsibility to allocate/create and free/destroy the page_pool.
Thus, after the driver have called xdp_rxq_info_unreg(), it is drivers
responsibility to free the page_pool, but with a RCU free call.  This
is done easily via the page_pool helper page_pool_destroy() (which
avoids touching any driver code during the RCU callback, which could
happen after the driver have been unloaded).

V8: address issues found by kbuild test robot
 - Address sparse should be static warnings
 - Allow xdp.o to be compiled without page_pool.o

V9: Remove inline from .c file, compiler knows best
Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

57d0a1c1

page_pool: refurbish version of page_pool code · ff7d6b27

由 Jesper Dangaard Brouer 提交于 4月 17, 2018

Need a fast page recycle mechanism for ndo_xdp_xmit API for returning
pages on DMA-TX completion time, which have good cross CPU
performance, given DMA-TX completion time can happen on a remote CPU.

Refurbish my page_pool code, that was presented[1] at MM-summit 2016.
Adapted page_pool code to not depend the page allocator and
integration into struct page.  The DMA mapping feature is kept,
even-though it will not be activated/used in this patchset.

[1] http://people.netfilter.org/hawk/presentations/MM-summit2016/generic_page_pool_mm_summit2016.pdf

V2: Adjustments requested by Tariq
 - Changed page_pool_create return codes, don't return NULL, only
   ERR_PTR, as this simplifies err handling in drivers.

V4: many small improvements and cleanups
- Add DOC comment section, that can be used by kernel-doc
- Improve fallback mode, to work better with refcnt based recycling
  e.g. remove a WARN as pointed out by Tariq
  e.g. quicker fallback if ptr_ring is empty.

V5: Fixed SPDX license as pointed out by Alexei

V6: Adjustments requested by Eric Dumazet
 - Adjust ____cacheline_aligned_in_smp usage/placement
 - Move rcu_head in struct page_pool
 - Free pages quicker on destroy, minimize resources delayed an RCU period
 - Remove code for forward/backward compat ABI interface

V8: Issues found by kbuild test robot
 - Address sparse should be static warnings
 - Only compile+link when a driver use/select page_pool,
   mlx5 selects CONFIG_PAGE_POOL, although its first used in two patches
Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ff7d6b27

xdp: rhashtable with allocator ID to pointer mapping · 8d5d8852

由 Jesper Dangaard Brouer 提交于 4月 17, 2018

Use the IDA infrastructure for getting a cyclic increasing ID number,
that is used for keeping track of each registered allocator per
RX-queue xdp_rxq_info.  Instead of using the IDR infrastructure, which
uses a radix tree, use a dynamic rhashtable, for creating ID to
pointer lookup table, because this is faster.

The problem that is being solved here is that, the xdp_rxq_info
pointer (stored in xdp_buff) cannot be used directly, as the
guaranteed lifetime is too short.  The info is needed on a
(potentially) remote CPU during DMA-TX completion time . In an
xdp_frame the xdp_mem_info is stored, when it got converted from an
xdp_buff, which is sufficient for the simple page refcnt based recycle
schemes.

For more advanced allocators there is a need to store a pointer to the
registered allocator.  Thus, there is a need to guard the lifetime or
validity of the allocator pointer, which is done through this
rhashtable ID map to pointer. The removal and validity of of the
allocator and helper struct xdp_mem_allocator is guarded by RCU.  The
allocator will be created by the driver, and registered with
xdp_rxq_info_reg_mem_model().

It is up-to debate who is responsible for freeing the allocator
pointer or invoking the allocator destructor function.  In any case,
this must happen via RCU freeing.

Use the IDA infrastructure for getting a cyclic increasing ID number,
that is used for keeping track of each registered allocator per
RX-queue xdp_rxq_info.

V4: Per req of Jason Wang
- Use xdp_rxq_info_reg_mem_model() in all drivers implementing
  XDP_REDIRECT, even-though it's not strictly necessary when
  allocator==NULL for type MEM_TYPE_PAGE_SHARED (given it's zero).

V6: Per req of Alex Duyck
- Introduce rhashtable_lookup() call in later patch

V8: Address sparse should be static warnings (from kbuild test robot)
Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8d5d8852

mlx5: register a memory model when XDP is enabled · 84f5e3fb

由 Jesper Dangaard Brouer 提交于 4月 17, 2018

Now all the users of ndo_xdp_xmit have been converted to use xdp_return_frame.
This enable a different memory model, thus activating another code path
in the xdp_return_frame API.

V2: Fixed issues pointed out by Tariq.
Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
Reviewed-by: NTariq Toukan <tariqt@mellanox.com>
Acked-by: NSaeed Mahameed <saeedm@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

84f5e3fb

i40e: convert to use generic xdp_frame and xdp_return_frame API · b411ef11

由 Jesper Dangaard Brouer 提交于 4月 17, 2018

Also convert driver i40e, which very recently got XDP_REDIRECT support
in commit d9314c47 ("i40e: add support for XDP_REDIRECT").

V7: This patch got added in V7 of this patchset.
Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b411ef11

bpf: cpumap convert to use generic xdp_frame · 70280ed9

由 Jesper Dangaard Brouer 提交于 4月 17, 2018

The generic xdp_frame format, was inspired by the cpumap own internal
xdp_pkt format.  It is now time to convert it over to the generic
xdp_frame format.  The cpumap needs one extra field dev_rx.
Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

70280ed9

virtio_net: convert to use generic xdp_frame and xdp_return_frame API · cac320c8

由 Jesper Dangaard Brouer 提交于 4月 17, 2018

The virtio_net driver assumes XDP frames are always released based on
page refcnt (via put_page).  Thus, is only queues the XDP data pointer
address and uses virt_to_head_page() to retrieve struct page.

Use the XDP return API to get away from such assumptions. Instead
queue an xdp_frame, which allow us to use the xdp_return_frame API,
when releasing the frame.

V8: Avoid endianness issues (found by kbuild test robot)
V9: Change __virtnet_xdp_xmit from bool to int return value (found by Dan Carpenter)
Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

cac320c8

tun: convert to use generic xdp_frame and xdp_return_frame API · 1ffcbc85

由 Jesper Dangaard Brouer 提交于 4月 17, 2018

The tuntap driver invented it's own driver specific way of queuing
XDP packets, by storing the xdp_buff information in the top of
the XDP frame data.

Convert it over to use the more generic xdp_frame structure.  The
main problem with the in-driver method is that the xdp_rxq_info pointer
cannot be trused/used when dequeueing the frame.

V3: Remove check based on feedback from Jason
Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1ffcbc85

xdp: introduce a new xdp_frame type · c0048cff

由 Jesper Dangaard Brouer 提交于 4月 17, 2018

This is needed to convert drivers tuntap and virtio_net.

This is a generalization of what is done inside cpumap, which will be
converted later.
Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c0048cff

xdp: move struct xdp_buff from filter.h to xdp.h · 106ca27f

由 Jesper Dangaard Brouer 提交于 4月 17, 2018

This is done to prepare for the next patch, and it is also
nice to move this XDP related struct out of filter.h.
Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

106ca27f

ixgbe: use xdp_return_frame API · 189ead81

由 Jesper Dangaard Brouer 提交于 4月 17, 2018

Extend struct ixgbe_tx_buffer to store the xdp_mem_info.

Notice that this could be optimized further by putting this into
a union in the struct ixgbe_tx_buffer, but this patchset
works towards removing this again.  Thus, this is not done.
Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

189ead81

xdp: introduce xdp_return_frame API and use in cpumap · 5ab073ff

由 Jesper Dangaard Brouer 提交于 4月 17, 2018

Introduce an xdp_return_frame API, and convert over cpumap as
the first user, given it have queued XDP frame structure to leverage.

V3: Cleanup and remove C99 style comments, pointed out by Alex Duyck.
V6: Remove comment that id will be added later (Req by Alex Duyck)
V8: Rename enum mem_type to xdp_mem_type (found by kbuild test robot)
Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5ab073ff

mlx5: basic XDP_REDIRECT forward support · 5168d732

由 Jesper Dangaard Brouer 提交于 4月 17, 2018

This implements basic XDP redirect support in mlx5 driver.

Notice that the ndo_xdp_xmit() is NOT implemented, because that API
need some changes that this patchset is working towards.

The main purpose of this patch is have different drivers doing
XDP_REDIRECT to show how different memory models behave in a cross
driver world.

Update(pre-RFCv2 Tariq): Need to DMA unmap page before xdp_do_redirect,
as the return API does not exist yet to to keep this mapped.

Update(pre-RFCv3 Saeed): Don't mix XDP_TX and XDP_REDIRECT flushing,
introduce xdpsq.db.redirect_flush boolian.

V9: Adjust for commit 121e8927 ("net/mlx5e: Refactor RQ XDP_TX indication")
Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
Reviewed-by: NTariq Toukan <tariqt@mellanox.com>
Acked-by: NSaeed Mahameed <saeedm@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5168d732

liquidio: Enhanced ethtool stats · 897ddc24

由 Intiyaz Basha 提交于 4月 16, 2018

1. Added red_drops stats. Inbound packets dropped by RED, buffer exhaustion
2. Included fcs_err, jabber_err, l2_err and frame_err errors under
   rx_errors
3. Included fifo_err, dmac_drop, red_drops, fw_err_pko, fw_err_link and
   fw_err_drop under rx_dropped
4. Included max_collision_fail, max_deferral_fail, total_collisions,
   fw_err_pko, fw_err_link, fw_err_drop and fw_err_pki under tx_dropped
5. Counting dma mapping errors
6. Added some firmware stats description and removed for some
Signed-off-by: NIntiyaz Basha <intiyaz.basha@cavium.com>
Acked-by: NDerek Chickles <derek.chickles@cavium.com>
Acked-by: NSatanand Burla <satananda.burla@cavium.com>
Signed-off-by: NFelix Manlunas <felix.manlunas@cavium.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

897ddc24

net: Remove unused tcp_set_state tracepoint · ef53e9e1

由 Andrey Ignatov 提交于 4月 16, 2018

This tracepoint was replaced by inet_sock_set_state in 563e0bb0 and not
used anywhere in the kernel anymore. Remove it.
Signed-off-by: NAndrey Ignatov <rdna@fb.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ef53e9e1

Merge branch 'pci-mrrs-consts' · 4c85d2d4

由 David S. Miller 提交于 4月 16, 2018

Heiner Kallweit says:

====================
PCI: add two more values for PCIe Max_Read_Request_Size and initially use them in r8169 network driver

In r8169 network driver I stumbled across a magic number translating
to PCI MRRS size 4K. The PCI core is still missing constants for
values 2K and 4K (as defined in PCI standard).

So let's add these two constants and use the 4K constant in r8169.

Second patch depends on the first one, therefore both patches
preferrably should go through either PCI or netdev tree.
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4c85d2d4

r8169: replace magic numbers with PCI MRRS constant · 8d98aa39

由 Heiner Kallweit 提交于 4月 16, 2018

Replace magic number "0x5 << MAX_READ_REQUEST_SHIFT" with the
appropriate constant as defined in PCI core.
Signed-off-by: NHeiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8d98aa39

PCI: Add two more values for PCIe Max_Read_Request_Size · a5724fc3

由 Heiner Kallweit 提交于 4月 16, 2018

This patch adds missing values for the max read request size.
E.g. network driver r8169 uses a value of 4K.
Signed-off-by: NHeiner Kallweit <hkallweit1@gmail.com>
Acked-by: NBjorn Helgaas <bhelgaas@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a5724fc3

Merge branch 'net-stmmac-Stop-using-hard-coded-callbacks' · 5da8baa3

由 David S. Miller 提交于 4月 16, 2018

Jose Abreu says:

====================
net: stmmac: Stop using hard-coded callbacks

This a starting point for a cleanup and re-organization of stmmac.

In this series we stop using hard-coded callbacks along the code and use
instead helpers which are defined in a single place ("hwif.h").

This brings several advantages:
	1) Less typing :)
	2) Guaranteed function pointer check
	3) More flexibility

By 2) we stop using the repeated pattern of:
	if (priv->hw->mac->some_func)
		priv->hw->mac->some_func(...)

I didn't check but I expect the final .ko will be bigger with this series
because *all* of function pointers are checked.

Anyway, I hope this can make the code more readable and more flexible now.
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5da8baa3

net: stmmac: Switch stmmac_mode_ops to generic HW Interface Helpers · 2c520b1c

由 Jose Abreu 提交于 4月 16, 2018

Switch stmmac_mode_ops to generic Hardware Interface Helpers instead of
using hard-coded callbacks. This makes the code more readable and more
flexible.

No functional change.
Signed-off-by: NJose Abreu <joabreu@synopsys.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Joao Pinto <jpinto@synopsys.com>
Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Cc: Alexandre Torgue <alexandre.torgue@st.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2c520b1c

net: stmmac: Switch stmmac_hwtimestamp to generic HW Interface Helpers · cc4c9001

由 Jose Abreu 提交于 4月 16, 2018

Switch stmmac_hwtimestamp to generic Hardware Interface Helpers instead
of using hard-coded callbacks. This makes the code more readable and
more flexible.

No functional change.
Signed-off-by: NJose Abreu <joabreu@synopsys.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Joao Pinto <jpinto@synopsys.com>
Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Cc: Alexandre Torgue <alexandre.torgue@st.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

cc4c9001

net: stmmac: Switch stmmac_ops to generic HW Interface Helpers · c10d4c82

由 Jose Abreu 提交于 4月 16, 2018

Switch stmmac_ops to generic Hardware Interface Helpers instead of using
hard-coded callbacks. This makes the code more readable and more
flexible.

No functional change.
Signed-off-by: NJose Abreu <joabreu@synopsys.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Joao Pinto <jpinto@synopsys.com>
Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Cc: Alexandre Torgue <alexandre.torgue@st.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c10d4c82

net: stmmac: Switch stmmac_dma_ops to generic HW Interface Helpers · a4e887fa

由 Jose Abreu 提交于 4月 16, 2018

Switch stmmac_dma_ops to generic Hardware Interface Helpers instead of
using hard-coded callbacks. This makes the code more readable and more
flexible.

No functional change.
Signed-off-by: NJose Abreu <joabreu@synopsys.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Joao Pinto <jpinto@synopsys.com>
Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Cc: Alexandre Torgue <alexandre.torgue@st.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a4e887fa

net: stmmac: Switch stmmac_desc_ops to generic HW Interface Helpers · 42de047d

由 Jose Abreu 提交于 4月 16, 2018

Switch stmmac_desc_ops to generic Hardware Interface Helpers instead of
using hard-coded callbacks. This makes the code more readable and more
flexible.

No functional change.
Signed-off-by: NJose Abreu <joabreu@synopsys.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Joao Pinto <jpinto@synopsys.com>
Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Cc: Alexandre Torgue <alexandre.torgue@st.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

42de047d

Merge branch 'tcp-zero-copy-receive' · 309c446c

由 David S. Miller 提交于 4月 16, 2018

Eric Dumazet says:

====================
tcp: add zero copy receive

This patch series add mmap() support to TCP sockets for RX zero copy.

While tcp_mmap() patch itself is quite small (~100 LOC), optimal support
for asynchronous mmap() required better SO_RCVLOWAT behavior, and a
test program to demonstrate how mmap() on TCP sockets can be used.

Note that mmap() (and associated munmap()) calls are adding more
pressure on per-process VM semaphore, so might not show benefit
for processus with high number of threads.
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

309c446c

selftests: net: add tcp_mmap program · 192dc405