提交 · b1e455260c9187b16dd4ebc428b817ebac322043 · openeuler / raspberrypi-kernel

01 5月, 2017 32 次提交

mlxsw: spectrum_router: Simplify VRF enslavement · b1e45526

由 Ido Schimmel 提交于 4月 30, 2017

When a netdev is enslaved to a VRF master, its router interface (RIF)
needs to be destroyed (if exists) and a new one created using the
corresponding virtual router (VR).

>From the driver's perspective, the above is equivalent to an inetaddr
event sent for this netdev. Therefore, when a port netdev (or its
uppers) are enslaved to a VRF master, call the same function that
would've been called had a NETDEV_UP was sent for this netdev in the
inetaddr notification chain.

This patch also fixes a bug when a LAG netdev with an existing RIF is
enslaved to a VRF. Before this patch, each LAG port would drop the
reference on the RIF, but would re-join the same one (in the wrong VR)
soon after. With this patch, the corresponding RIF is first destroyed
and a new one is created using the correct VR.

Fixes: 7179eb5a ("mlxsw: spectrum_router: Add support for VRFs")
Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
Reviewed-by: NJiri Pirko <jiri@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b1e45526

qed: Prevent warning without CONFIG_RFS_ACCEL · 07ff2ed0

由 Mintz, Yuval 提交于 4月 30, 2017

After removing the PTP related initialization from slowpath start,
the remaining PTT entry is required only in case CONFIG_RFS_ACCEL is set.
Otherwise, it leads to a warning due to it being unused.

Fixes: d179bd16 ("qed: Acquire/release ptt_ptp lock when enabling/disabling PTP")
Signed-off-by: NYuval Mintz <Yuval.Mintz@cavium.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

07ff2ed0

qed: output the DPM status and WID count · 20b1bd96

由 Ram Amrani 提交于 4月 30, 2017

Output to the RDMA driver whether DPM mode is enabled or disabled in
the HW and if so what is the number of WIDs it supports
Signed-off-by: NRam Amrani <Ram.Amrani@cavium.com>
Signed-off-by: NYuval Mintz <Yuval.Mintz@cavium.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

20b1bd96

qed: align DPI configuration to HW requirements · 107392b7

由 Ram Amrani 提交于 4月 30, 2017

When calculating doorbell BAR partitioning round up the number of
CPUs to the nearest power of 2 so the size of the DPI (per user
section) configured in the hardware will be stored properly and
not truncated.
Signed-off-by: NRam Amrani <Ram.Amrani@cavium.com>
Signed-off-by: NYuval Mintz <Yuval.Mintz@cavium.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

107392b7

qed: verify RoCE resource bitmaps are released · e015d58b

由 Ram Amrani 提交于 4月 30, 2017

Add mechanism to verify RoCE resources are released prior to freeing the
bitmaps. If this is not the case, print what resources were not released.
Signed-off-by: NRam Amrani <Ram.Amrani@cavium.com>
Signed-off-by: NYuval Mintz <Yuval.Mintz@cavium.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e015d58b

qed: add error handling flow to TID deregistratin posting failure · 10536194

由 Ram Amrani 提交于 4月 30, 2017

If the posting of the ramrod for the purpose of TID deregistration
fails, abort the deregistration operation without using the FW's
return code.
Signed-off-by: NRam Amrani <Ram.Amrani@cavium.com>
Signed-off-by: NYuval Mintz <Yuval.Mintz@cavium.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

10536194

qed: remove unused SQ error state · ba0154e9

由 Ram Amrani 提交于 4月 30, 2017

The internal RoCE SQE QP state isn't being used. Instead we mark the
QP as in regular error state.
Signed-off-by: NRam Amrani <Ram.Amrani@cavium.com>
Signed-off-by: NYuval Mintz <Yuval.Mintz@cavium.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ba0154e9

qed: configure the RoCE max message size · 793ea8a9

由 Ram Amrani 提交于 4月 30, 2017

Signed-off-by: NRam Amrani <Ram.Amrani@cavium.com>
Signed-off-by: NYuval Mintz <Yuval.Mintz@cavium.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

793ea8a9

benet: Use time_before_eq for time comparison · 2faf2657

由 Karim Eshapa 提交于 5月 01, 2017

Use time_before_eq for time comparison more safe and dealing
with timer wrapping to be future-proof.
Signed-off-by: NKarim Eshapa <karim.eshapa@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2faf2657

virtio_net: make use of extended ack message reporting · 9861ce03

由 Jakub Kicinski 提交于 4月 30, 2017

Try to carry error messages to the user via the netlink extended
ack message attribute.
Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9861ce03

nfp: make use of extended ack message reporting · d957c0f7

由 Jakub Kicinski 提交于 4月 30, 2017

Try to carry error messages to the user via the netlink extended
ack message attribute.
Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d957c0f7

net: phy: Allow BCM5481x PHYs to setup internal TX/RX clock delay · 73333626

由 Abhishek Shah 提交于 4月 30, 2017

This patch allows users to enable/disable internal TX and/or RX
clock delay for BCM5481x series PHYs so as to satisfy RGMII timing
specifications.

On a particular platform, whether TX and/or RX clock delay is required
depends on how PHY connected to the MAC IP. This requirement can be
specified through "phy-mode" property in the platform device tree.
Signed-off-by: NAbhishek Shah <abhishek.shah@broadcom.com>
Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

73333626

net: sunhme: fix spelling mistakes: "ParityErro" -> "ParityError" · d8325650

由 Colin Ian King 提交于 4月 29, 2017

trivial fix to spelling mistakes in printk message.
Signed-off-by: NColin Ian King <colin.king@canonical.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d8325650

bnx2x: Align RX buffers · 9b70de6d

由 Scott Wood 提交于 4月 28, 2017

The bnx2x driver is not providing proper alignment on the receive buffers it
passes to build_skb(), causing skb_shared_info to be misaligned.
skb_shared_info contains an atomic, and while PPC normally supports
unaligned accesses, it does not support unaligned atomics.

Aligning the size of rx buffers will ensure that page_frag_alloc() returns
aligned addresses.

This can be reproduced on PPC by setting the network MTU to 1450 (or other
non-multiple-of-4) and then generating sufficient inbound network traffic
(one or two large "wget"s usually does it), producing the following oops:

Unable to handle kernel paging request for unaligned access at address 0xc00000ffc43af656
Faulting instruction address: 0xc00000000080ef8c
Oops: Kernel access of bad area, sig: 7 [#1]
SMP NR_CPUS=2048
NUMA
PowerNV
Modules linked in: vmx_crypto powernv_rng rng_core powernv_op_panel leds_powernv led_class nfsd ip_tables x_tables autofs4 xfs lpfc bnx2x mdio libcrc32c crc_t10dif crct10dif_generic crct10dif_common
CPU: 104 PID: 0 Comm: swapper/104 Not tainted 4.11.0-rc8-00088-g4c761daf #2
task: c00000ffd4892400 task.stack: c00000ffd4920000
NIP: c00000000080ef8c LR: c00000000080eee8 CTR: c0000000001f8320
REGS: c00000ffffc33710 TRAP: 0600   Not tainted  (4.11.0-rc8-00088-g4c761daf)
MSR: 9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE>
  CR: 24082042  XER: 00000000
CFAR: c00000000080eea0 DAR: c00000ffc43af656 DSISR: 00000000 SOFTE: 1
GPR00: c000000000907f64 c00000ffffc33990 c000000000dd3b00 c00000ffcaf22100
GPR04: c00000ffcaf22e00 0000000000000000 0000000000000000 0000000000000000
GPR08: 0000000000b80008 c00000ffc43af636 c00000ffc43af656 0000000000000000
GPR12: c0000000001f6f00 c00000000fe1a000 000000000000049f 000000000000c51f
GPR16: 00000000ffffef33 0000000000000000 0000000000008a43 0000000000000001
GPR20: c00000ffc58a90c0 0000000000000000 000000000000dd86 0000000000000000
GPR24: c000007fd0ed10c0 00000000ffffffff 0000000000000158 000000000000014a
GPR28: c00000ffc43af010 c00000ffc9144000 c00000ffcaf22e00 c00000ffcaf22100
NIP [c00000000080ef8c] __skb_clone+0xdc/0x140
LR [c00000000080eee8] __skb_clone+0x38/0x140
Call Trace:
[c00000ffffc33990] [c00000000080fb74] skb_clone+0x74/0x110 (unreliable)
[c00000ffffc339c0] [c000000000907f64] packet_rcv+0x144/0x510
[c00000ffffc33a40] [c000000000827b64] __netif_receive_skb_core+0x5b4/0xd80
[c00000ffffc33b00] [c00000000082b2bc] netif_receive_skb_internal+0x2c/0xc0
[c00000ffffc33b40] [c00000000082c49c] napi_gro_receive+0x11c/0x260
[c00000ffffc33b80] [d000000066483d68] bnx2x_poll+0xcf8/0x17b0 [bnx2x]
[c00000ffffc33d00] [c00000000082babc] net_rx_action+0x31c/0x480
[c00000ffffc33e10] [c0000000000d5a44] __do_softirq+0x164/0x3d0
[c00000ffffc33f00] [c0000000000d60a8] irq_exit+0x108/0x120
[c00000ffffc33f20] [c000000000015b98] __do_irq+0x98/0x200
[c00000ffffc33f90] [c000000000027f14] call_do_irq+0x14/0x24
[c00000ffd4923a90] [c000000000015d94] do_IRQ+0x94/0x110
[c00000ffd4923ae0] [c000000000008d90] hardware_interrupt_common+0x150/0x160
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9b70de6d

liquidio: silence a locking static checker warning · 77041e89

由 Dan Carpenter 提交于 4月 28, 2017

Presumably we never hit this return, but static checkers complain that
we need to unlock so we may as well fix that.
Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
Acked-by: NFelix Manlunas <felix.manlunas@cavium.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

77041e89

qed: Unlock on error in qed_vf_pf_acquire() · 66117a9d

由 Dan Carpenter 提交于 4月 28, 2017

My static checker complains that we're holding a mutex on this error
path.  Let's goto exit instead of returning directly.

Fixes: b0bccb69 ("qed: Change locking scheme for VF channel")
Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
Acked-by: NYuval Mintz <Yuval.Mintz@cavium.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

66117a9d

net: hns: support deferred probe when no mdio · 804ffe5c

由 lipeng 提交于 4月 28, 2017

In the hip06 and hip07 SoCs, phy connect to mdio bus.The mdio
module is probed with module_init, and, as such,
is not guaranteed to probe before the HNS driver. So we need
to support deferred probe.

We check for probe deferral in the mac init, so we not init DSAF
when there is no mdio, and free all resource, to later learn that
we need to defer the probe.
Signed-off-by: Nlipeng <lipeng321@huawei.com>
Reviewed-by: NYisen Zhuang <yisen.zhuang@huawei.com>
Reviewed-by: NMatthias Brugger <mbrugger@suse.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

804ffe5c

net: hns: support deferred probe when can not obtain irq · 2fdd6baf

由 lipeng 提交于 4月 28, 2017

In the hip06 and hip07 SoCs, the interrupt lines from the
DSAF controllers are connected to mbigen hw module.
The mbigen module is probed with module_init, and, as such,
is not guaranteed to probe before the HNS driver. So we need
to support deferred probe.
Signed-off-by: Nlipeng <lipeng321@huawei.com>
Reviewed-by: NYisen Zhuang <yisen.zhuang@huawei.com>
Reviewed-by: NMatthias Brugger <mbrugger@suse.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2fdd6baf

nfp: provide 256 bytes of XDP headroom in all configurations · dbf637ff

由 Jakub Kicinski 提交于 4月 27, 2017

For legacy reasons NFP FW may be compiled to DMA packets to a constant
offset into the buffer and use the space before it for metadata.  This
ensures that packets data always start at a certain offset regardless of
the amount of preceding metadata.

If rx offset is set to 0 there may still be up to 64 bytes of metadata
but metadata will start at the beginning of the buffer, instead of:

    data_start_offset = rx_offset - meta_len

Even though we make the buffers larger to accommodate up to 64 bytes of
metadata, if there is only N bytes of metadata, we will end up with
N bytes of headroom and 64 - N bytes of tailroom.  Therefore we can't
rely on that space for XDP headroom.  Make sure we always allocate
full 256 bytes.  This, unfortunately, means we can't fit the headroom
on an u8 any more.
Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

dbf637ff

nfp: don't completely refuse to work with old flashes · 85cb207e

由 Jakub Kicinski 提交于 4月 27, 2017

Right now the required Service Process ABI version is still tied
to max ID of known commands. For new NSP commands we are adding
we are checking if NSP version is recent enough on command-by-command
basis. The driver doesn't have to force the device to have the
very latest flash, anything newer than 0.8 should do.
Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

85cb207e

nfp: avoid reading TX queue indexes from the device · d38df0d3

由 Jakub Kicinski 提交于 4月 27, 2017

Reading TX queue indexes from the device memory on each interrupt
is expensive.  It's doubly expensive with XDP running since we have
two TX rings to check there.  If the software indexes indicate that
the TX queue is completely empty, however, we don't need to look at
the device completion index at all.

The queuing CPU is doing a wmb() before kicking the device TX so
we should be safe to assume on the CPU handling the completions will
never see old value of the software copy of the index.
Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d38df0d3

nfp: do simple XDP TX buffer recycling · 92e68195

由 Jakub Kicinski 提交于 4月 27, 2017

On the RX path we follow the "drop if allocation of replacement
buffer fails" rule.  With XDP we extended that to the TX action,
so if XDP prog returned TX but allocation of replacement RX buffer
failed, we will drop the packet.

To improve our XDP TX performance extend the idea of rings being
always full to XDP TX rings.  Pre-fill the XDP TX rings with RX
buffers, and when XDP prog returns TX action swap the RX buffer
with the next buffer from the TX ring.

XDP TX complete will no longer free the buffers but let them
sit on the TX ring and wait for swap with RX buffer, instead.
Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

92e68195

nfp: drop rx_ring param from buffer allocation · d78005a5

由 Jakub Kicinski 提交于 4月 27, 2017

We will soon allocate RX buffers for caching on XDP TX rings.
The rx_ring parameter passed to nfp_net_rx_alloc_one() is not
actually used, remove it.
Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d78005a5

nfp: replace -ENOTSUPP with -EOPNOTSUPP · 46c50518

由 Jakub Kicinski 提交于 4月 27, 2017

As Or points out in commit 423b3aec ("net/mlx4: Change ENOTSUPP
to EOPNOTSUPP"), ENOTSUPP is NFS specific error.  Replace it with
EOPNOTSUPP.
Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

46c50518

virtio-net: use netif_tx_napi_add for tx napi · 1d11e732

由 Willem de Bruijn 提交于 4月 27, 2017

Avoid hashing the tx napi struct into napi_hash[], which is used for
busy polling receive queues.
Signed-off-by: NWillem de Bruijn <willemb@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1d11e732

geneve: fix incorrect setting of UDP checksum flag · 5e0740c4

由 Girish Moodalbail 提交于 4月 27, 2017

Creating a geneve link with 'udpcsum' set results in a creation of link
for which UDP checksum will NOT be computed on outbound packets, as can
be seen below.

11: gen0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN
    link/ether c2:85:27:b6:b4:15 brd ff:ff:ff:ff:ff:ff promiscuity 0
    geneve id 200 remote 192.168.13.1 dstport 6081 noudpcsum

Similarly, creating a link with 'noudpcsum' set results in a creation
of link for which UDP checksum will be computed on outbound packets.

Fixes: 9b4437a5 ("geneve: Unify LWT and netdev handling.")
Signed-off-by: NGirish Moodalbail <girish.moodalbail@oracle.com>
Acked-by: NPravin B Shelar <pshelar@ovn.org>
Acked-by: NLance Richardson <lrichard@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5e0740c4

vxlan: do not output confusing error message · baf4d786

由 Jiri Benc 提交于 4月 27, 2017

The message "Cannot bind port X, err=Y" creates only confusion. In metadata
based mode, failure of IPv6 socket creation is okay if IPv6 is disabled and
no error message should be printed. But when IPv6 tunnel was requested, such
failure is fatal. The vxlan_socket_create does not know when the error is
harmless and when it's not.

Instead of passing such information down to vxlan_socket_create, remove the
message completely. It's not useful. We propagate the error code up to the
user space and the port number comes from the user space. There's nothing in
the message that the process creating vxlan interface does not know.
Signed-off-by: NJiri Benc <jbenc@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

baf4d786

vxlan: correctly handle ipv6.disable module parameter · d074bf96

由 Jiri Benc 提交于 4月 27, 2017

When IPv6 is compiled but disabled at runtime, __vxlan_sock_add returns
-EAFNOSUPPORT. For metadata based tunnels, this causes failure of the whole
operation of bringing up the tunnel.

Ignore failure of IPv6 socket creation for metadata based tunnels caused by
IPv6 not being available.

Fixes: b1be00a6 ("vxlan: support both IPv4 and IPv6 sockets in a single vxlan device")
Signed-off-by: NJiri Benc <jbenc@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d074bf96

bnx2x: Get rid of useless temporary variable · 90a1bb98

由 Andy Shevchenko 提交于 4月 27, 2017

Replace pattern

 int status;
 ...
 status = func(...);
 return status;

by

 return func(...);

No functional change intented.
Signed-off-by: NAndy Shevchenko <andy.shevchenko@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

90a1bb98

bnx2x: Reuse bnx2x_null_format_ver() · b77f0167

由 Andy Shevchenko 提交于 4月 27, 2017

Reuse bnx2x_null_format_ver() in functions where it's appropriated
instead of open coded variant.
Signed-off-by: NAndy Shevchenko <andy.shevchenko@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b77f0167

bnx2x: Replace custom scnprintf() · 55b218c1

由 Andy Shevchenko 提交于 4月 27, 2017

Use scnprintf() when printing version instead of custom open coded variants.
Signed-off-by: NAndy Shevchenko <andy.shevchenko@gmail.com>
Acked-by: NYuval Mintz <Yuval.Mintz@cavium.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

55b218c1

net: macb: fix phy interrupt parsing · ae3696c1

由 Alexandre Belloni 提交于 4月 26, 2017

Since 83a77e9e, the phydev irq is explicitly set to PHY_POLL when
there is no pdata. It doesn't work on DT enabled platforms because the
phydev irq is already set by libphy before.

Fixes: 83a77e9e ("net: macb: Added PCI wrapper for Platform Driver.")
Signed-off-by: NAlexandre Belloni <alexandre.belloni@free-electrons.com>
Acked-by: NNicolas Ferre <nicolas.ferre@microchip.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ae3696c1

30 4月, 2017 8 次提交

net/mlx5: E-Switch, Avoid redundant memory allocation · 0a0ab1d2

由 Eli Cohen 提交于 2月 28, 2017

struct esw_mc_addr is a small struct that can be part of struct
mlx5_eswitch. Define it as a field and not as a pointer and save the
kzalloc call and then error flow handling.
Signed-off-by: NEli Cohen <eli@mellanox.com>
Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>

0a0ab1d2

net/mlx5e: Disable HW LRO when PCI is slower than link on striding RQ · 0f6e4cf6

由 Eran Ben Elisha 提交于 4月 26, 2017

We will activate the HW LRO only on servers with PCI BW > MAX LINK BW,
or when PCI BW > 16Gbps. On other cases we do not want LRO by default as
LRO sessions might get timeout and add redundant software overhead.

Tested:
	ethtool -k <ifs-name> | grep large-receive-offload
	On systems with and without the limitations.
Signed-off-by: NEran Ben Elisha <eranbe@mellanox.com>
Cc: kernel-team@fb.com
Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>

0f6e4cf6

net/mlx5e: Use u8 as ownership type in mlx5e_get_cqe() · b1b03bde

由 Tariq Toukan 提交于 4月 03, 2017

CQE ownership indication is as small as a single bit.
Use u8 to speedup the comparison.
Signed-off-by: NTariq Toukan <tariqt@mellanox.com>
Cc: kernel-team@fb.com
Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>

b1b03bde

net/mlx5e: Use prefetchw when a write is to follow · ad78af9b

由 Tariq Toukan 提交于 2月 15, 2017

"prefetchw()" prefetches the cacheline for write. Use it for
skb->data, as soon we'll be copying the packet header there.

Performance:
Single-stream packet-rate tested with pktgen.
Packets are dropped in tc level to zoom into driver data-path.
Larger gain is expected for smaller packets, as less time
is spent on handling SKB fragments, making the path shorter
and the improvement more significant.

---------------------------------------------
packet size | before    | after     | gain  |
64B         | 4,113,306 | 4,778,720 |  16%  |
1024B       | 3,633,819 | 3,950,593 | 8.7%  |
Signed-off-by: NTariq Toukan <tariqt@mellanox.com>
Cc: kernel-team@fb.com
Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>

ad78af9b

net/mlx5e: Optimize poll ICOSQ completion queue · 1f5b1e47

由 Tariq Toukan 提交于 3月 29, 2017

UMR operations are more frequent and important.
Check them first, and add a compiler branch predictor hint.

According to current design, ICOSQ CQ can contain at most one
pending CQE per napi. Poll function is optimized accordingly.

Performance:
Single-stream packet-rate tested with pktgen.
Packets are dropped in tc level to zoom into driver data-path.
Larger gain is expected for larger packet sizes, as BW is higher
and UMR posts are more frequent.

---------------------------------------------
packet size | before    | after     | gain  |
64B         | 4,092,370 | 4,113,306 |  0.5% |
1024B       | 3,421,435 | 3,633,819 |  6.2% |
Signed-off-by: NTariq Toukan <tariqt@mellanox.com>
Cc: kernel-team@fb.com
Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>

1f5b1e47

net/mlx5e: Act on delay probe time updates · a2fa1fe5

由 Hadar Hen Zion 提交于 2月 14, 2017

The user can change delay_first_probe_time parameter through sysctl.
Listen to NETEVENT_DELAY_PROBE_TIME_UPDATE notifications and update the
intervals for updating the neighbours 'used' value periodic task and
for flow HW counters query periodic task.
Both of the intervals will be update only in case the new delay prob
time value is lower the current interval.

Since the driver saves only one min interval value and not per device,
the users will be able to set lower interval value for updating
neighbour 'used' value periodic task but they won't be able to schedule
a higher interval for this periodic task.
The used interval for scheduling neighbour 'used' value periodic task is
the minimal delay prob time parameter ever seen by the driver.
Signed-off-by: NHadar Hen Zion <hadarh@mellanox.com>
Reviewed-by: NOr Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>

a2fa1fe5

net/mlx5e: Update neighbour 'used' state using HW flow rules counters · f6dfb4c3

由 Hadar Hen Zion 提交于 2月 24, 2017

When IP tunnel encapsulation rules are offloaded, the kernel can't see
the traffic of the offloaded flow. The neighbour for the IP tunnel
destination of the offloaded flow can mistakenly become STALE and
deleted by the kernel since its 'used' value wasn't changed.

To make sure that a neighbour which is used by the HW won't become
STALE, we proactively update the neighbour 'used' value every
DELAY_PROBE_TIME period, when packets were matched and counted by the HW
for one of the tunnel encap flows related to this neighbour.

The periodic task that updates the used neighbours is scheduled when a
tunnel encap rule is successfully offloaded into HW and keeps re-scheduling
itself as long as the representor's neighbours list isn't empty.

Add, remove, lookup and status change operations done over the
representor's neighbours list or the neighbour hash entry encaps list
are all serialized by RTNL lock.
Signed-off-by: NHadar Hen Zion <hadarh@mellanox.com>
Reviewed-by: NOr Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>

f6dfb4c3

net/mlx5e: Add support to neighbour update flow · 232c0013

由 Hadar Hen Zion 提交于 3月 20, 2017

In order to offload TC encap rules, the driver does a lookup for the IP
tunnel neighbour according to the output device and the destination IP
given by the user.

To keep tracking after the validity state of such neighbours, we keep
the neighbours information (pair of device pointer and destination IP)
in a hash table maintained at the relevant egress representor and
register to get NETEVENT_NEIGH_UPDATE events. When getting neighbour update
netevent, we search for a match among the cached neighbours entries used for
encapsulation.

In case the neighbour isn't valid, we can't offload the flow into the
HW. We cache the flow (requested matching and actions) in the driver and
offload the rule later, when the neighbour is resolved and becomes
valid.

When a flow is only cached in the driver and not offloaded into HW
yet, we use EAGAIN return value to mark it internally, the TC ndo still
returns success.

Listen to kernel neighbour update netevents to trace relevant neighbours
validity state:

1. If a neighbour becomes valid, offload the related rules to HW.

2. If the neighbour becomes invalid, remove the related rules from HW.

3. If the neighbour mac address was changed, update the encap header.
   Remove all the offloaded rules using the old encap header from the HW
   and insert new rules to HW with updated encap header.

Access to the neighbors hash table is protected by RTNL lock of its
caller or by the table's spinlock.

Details of the locking/synchronization among the different actions
applied on the neighbour table:

Add/remove operations - protected by RTNL lock of its caller (all TC
commands are protected by RTNL lock). Add and remove operations are
initiated only when the user inserts/removes a TC rule into/from the driver.

Lookup/remove operations - since the lookup operation is done from
netevent notifier block, RTNL lock can't be used (atomic context).
Use the table's spin lock to protect lookups from TC user removal operation.
bh is used since netevent can be called from a softirq context.

Lookup/add operations - The hash table access functions are taking
care of the protection between lookup and add operations.

When adding/removing encap headers and rules to/from the HW, RTNL lock
is used. It can happen when:

1. The user inserts/removes a TC rule into/from the driver (TC commands
are protected by RTNL lock of it's caller).

2. The driver gets neighbour notification event, which reports about
neighbour validity status change. Before adding/removing encap headers
and rules to/from the HW, RTNL lock is taken.

A neighbour hash table entry should be freed when its encap list is empty.
Since The neighbour update netevent notification schedules a neighbour
update work that uses the neighbour hash entry, it can't be freed
unconditionally when the encap list becomes empty during TC delete rule flow.
Use reference count to protect from freeing neighbour hash table entry
while it's still in use.

When the user asks to unregister a netdvice used by one of the neigbours,
neighbour removal notification is received. Then we take a reference on the
neighbour and don't free it until the relevant encap entries (and flows) are
marked as invalid (not offloaded) and removed from HW.
As long as the encap entry is still valid (checked under RTNL lock) we
can safely access the neighbour device saved on mlx5e_neigh struct.
Signed-off-by: NHadar Hen Zion <hadarh@mellanox.com>
Reviewed-by: NOr Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>

232c0013