提交 · 4fdd2dbc7cad2863c8c44793a383182f75442dfb · openeuler / Kernel

27 5月, 2020 40 次提交

Merge branch 'mlxsw-Various-trap-changes-part-2' · 4fdd2dbc

由 David S. Miller 提交于 5月 26, 2020

Ido Schimmel says:

====================
mlxsw: Various trap changes - part 2

This patch set contains another set of small changes in mlxsw trap
configuration. It is the last set before exposing control traps (e.g.,
IGMP query, ARP request) via devlink-trap.

Tested with existing devlink-trap selftests. Please see individual
patches for a detailed changelog.
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4fdd2dbc

mlxsw: spectrum_router: Allow programming link-local prefix routes · 10d3757f

由 Ido Schimmel 提交于 5月 26, 2020

The device has a trap for IPv6 packets that need be routed and have a
unicast link-local destination IP (i.e., fe80::/10). This allows mlxsw
to ignore link-local routes, as the packets will be trapped to the CPU
in any case.

However, since link-local routes are not programmed, it is possible for
routed packets to hit the default route which might also be programmed
to trap packets. This means that packets with a link-local destination
IP might be trapped for the wrong reason.

To overcome this, allow programming link-local prefix routes (usually
one fe80::/64 per-table), so that the packets will be forwarded until
reaching the link-local trap.
Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
Reviewed-by: NJiri Pirko <jiri@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

10d3757f

mlxsw: spectrum: Add packet traps for BFD packets · 9785b92b

由 Ido Schimmel 提交于 5月 26, 2020

Bidirectional Forwarding Detection (BFD) provides "low-overhead,
short-duration detection of failures in the path between adjacent
forwarding engines" (RFC 5880).

This is accomplished by exchanging BFD packets between the two
forwarding engines. Up until now these packets were trapped via the
general local delivery (i.e., IP2ME) trap which also traps a lot of
other packets that are not as time-sensitive as BFD packets.

Expose dedicated traps for BFD packets so that user space could
configure a dedicated policer for them.
Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
Reviewed-by: NJiri Pirko <jiri@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9785b92b

mlxsw: spectrum: Treat IPv6 link-local SIP as an exception · dacc4e3a

由 Ido Schimmel 提交于 5月 26, 2020

IPv6 packets that need to be forwarded and have a link-local source IP are
dropped by the kernel and an ICMPv6 "Destination unreachable" is sent to
the sending host.

As such, change the trap group of such packets so that they do not
interfere with IPv6 management packets. In the future this trap will be
exposed as an exception via devlink-trap.
Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
Reviewed-by: NJiri Pirko <jiri@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

dacc4e3a

mlxsw: spectrum: Share one group for all locally delivered packets · 1260e083

由 Ido Schimmel 提交于 5月 26, 2020

Routed IP packets with the Router Alert option need to be trapped to
the CPU as they might need to be locally delivered to raw sockets with
the IP_ROUTER_ALERT / IPV6_ROUTER_ALERT socket option.

Move them to the same group with other packets that might need to be
trapped following route lookup.
Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
Reviewed-by: NJiri Pirko <jiri@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1260e083

mlxsw: reg: Move all trap groups under the same enum · 500769be

由 Ido Schimmel 提交于 5月 26, 2020

After the previous patch the split is no longer necessary and all the
trap groups can be moved under the same enum.
Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
Reviewed-by: NJiri Pirko <jiri@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

500769be

mlxsw: spectrum_trap: Do not hard code "thin" policer identifier · b87bde80

由 Ido Schimmel 提交于 5月 26, 2020

As explained in commit e6125230 ("mlxsw: spectrum_trap: Introduce
dummy group with thin policer"), the purpose of the "thin" policer is to
pass as less packets as possible to the CPU.

The identifier of this policer is currently set according to the maximum
number of used trap groups, but this is fragile: On Spectrum-1 the
maximum number of policers is less than the maximum number of trap
groups, which might result in an invalid policer identifier in case the
number of used trap groups grows beyond the policer limit.

Solve this by dynamically allocating the policer identifier.
Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
Reviewed-by: NJiri Pirko <jiri@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b87bde80

mlxsw: switchx2: Move SwitchX-2 trap groups out of main enum · 03cb0ce0

由 Ido Schimmel 提交于 5月 26, 2020

The number of Spectrum trap groups is not infinite, but two identifiers
are occupied by SwitchX-2 specific trap groups. Free these identifiers
by moving them out of the main enum.
Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
Reviewed-by: NJiri Pirko <jiri@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

03cb0ce0

mlxsw: spectrum: Reduce priority of locally delivered packets · 025b7de7

由 Ido Schimmel 提交于 5月 26, 2020

To align with recent recommended values. Will be configurable by future
patches.
Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
Reviewed-by: NJiri Pirko <jiri@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

025b7de7

mlxsw: spectrum: Use same trap group for local routes and link-local destination · 1e3cd589

由 Ido Schimmel 提交于 5月 26, 2020

Packets with an IPv6 link-local destination (i.e., fe80::/10) should not
be forwarded and are therefore trapped to the CPU for local delivery.
Since these packets are trapped for the same logical reason as packets
hitting local routes, associate both traps with the same group.
Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
Reviewed-by: NJiri Pirko <jiri@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1e3cd589

mlxsw: spectrum: Use separate trap group for FID miss · d322309d

由 Ido Schimmel 提交于 5月 26, 2020

When a packet enters the device it is classified to a filtering
identifier (FID) based on the ingress port and VLAN. The FID miss trap
is used to trap packets for which a FID could not be found.

In mlxsw this trap should only be triggered when a port is enslaved to
an OVS bridge and a matching ACL rule could not be found, so as to
trigger learning.

These packets are therefore completely unrelated to packets hitting
local routes and should be in a different group. Move them.
Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
Reviewed-by: NJiri Pirko <jiri@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d322309d

mlxsw: spectrum: Use same trap group for various IPv6 packets · 954eef26

由 Ido Schimmel 提交于 5月 26, 2020

Group these various IPv6 packets (e.g., router solicitations, router
advertisement) together and subject them to the same policer.
Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
Reviewed-by: NJiri Pirko <jiri@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

954eef26

mlxsw: spectrum: Rename IPv6 ND trap group · 412df3d1

由 Ido Schimmel 提交于 5月 26, 2020

The IPv6 Neighbour Discovery (ND) group will be used for various IPv6
packets, not all of which fall under the definition of ND, so rename it
to "IPV6" which is more appropriate.
Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
Reviewed-by: NJiri Pirko <jiri@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

412df3d1

mlxsw: spectrum: Use same switch case for identical groups · 761bc42f

由 Ido Schimmel 提交于 5月 26, 2020

Trap groups that use the same policer settings can share the same switch
case.
Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
Reviewed-by: NJiri Pirko <jiri@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

761bc42f

mlxsw: spectrum: Use dedicated trap group for ACL trap · 3c2d8a04

由 Ido Schimmel 提交于 5月 26, 2020

Packets that are trapped via tc's trap action are currently subject to
the same policer as packets hitting local routes. The latter are
critical to the correct functioning of the control plane, while the
former are mainly used for traffic inspection.

Split the ACL trap to a separate group with its own policer. Use a
higher priority for these traps than for traps using mirror action
(e.g., ARP, IGMP). Otherwise, packets matching both traps will not be
forwarded in hardware (because of trap action) and also not forwarded in
software because they will be marked with 'offload_fwd_mark'.
Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
Reviewed-by: NJiri Pirko <jiri@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3c2d8a04

mptcp: attempt coalescing when moving skbs to mptcp rx queue · 4e637c70

由 Florian Westphal 提交于 5月 25, 2020

We can try to coalesce skbs we take from the subflows rx queue with the
tail of the mptcp rx queue.

If successful, the skb head can be discarded early.

We can also free the skb extensions, we do not access them after this.
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4e637c70

r8169: improve rtl_remove_one · 12b1bc75

由 Heiner Kallweit 提交于 5月 25, 2020

Don't call netif_napi_del() manually, free_netdev() does this for us.
In addition reorder calls to match reverse order of calls in probe().
Signed-off-by: NHeiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

12b1bc75

Merge branch 'net-ethernet-fec-move-GPR-register-offset-and-bit-into-DT' · 394fe485

由 David S. Miller 提交于 5月 26, 2020

Fugang Duan says:

====================
net: ethernet: fec: move GPR register offset and bit into DT

The commit da722186 (net: fec: set GPR bit on suspend by
DT configuration) set the GPR reigster offset and bit in driver
for wol feature support.

It brings trouble to enable wol feature on imx6sx/imx6ul/imx7d
platforms that have multiple ethernet instances with different
GPR bit for stop mode control. So the patch set is to move GPR
register offset and bit define into DT, and enable imx6q/imx6dl
imx6qp/imx6sx/imx6ul/imx7d stop mode support.

Currently, below NXP i.MX boards support wol:
- imx6q/imx6dl/imx6qp sabresd
- imx6sx sabreauto
- imx7d sdb

imx6q/imx6dl/imx6qp sabresd board dts file miss the property
"fsl,magic-packet;", so patch#4 is to add the property for stop
mode support.

v1 -> v2:
 - driver: switch back to store the quirks bitmask in driver_data
 - dt-bindings: rename 'gpr' property string to 'fsl,stop-mode'
 - imx6/7 dtsi: add imx6sx/imx6ul/imx7d ethernet stop mode property
v2 -> v3:
 - driver: suggested by Sascha Hauer, use a struct fec_devinfo for
   abstracting differences between different hardware variants,
   it can give more freedom to describe the differences.
 - imx6/7 dtsi: correct one typo pointed out by Andrew.

Thanks Martin, Andrew and Sascha Hauer for the review.
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

394fe485

ARM: dts: imx6qdl-sabresd: enable fec wake-on-lan · f099b8b7

由 Fugang Duan 提交于 5月 26, 2020

Enable ethernet wake-on-lan feature for imx6q/dl/qp sabresd
boards since the PHY clock is supplied by external osc.
Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
Signed-off-by: NFugang Duan <fugang.duan@nxp.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f099b8b7

ARM: dts: imx: add ethernet stop mode property · d009a621

由 Fugang Duan 提交于 5月 26, 2020

- Update the imx6qdl gpr property to define gpr register
  offset and bit in DT.
- Add imx6sx/imx6ul/imx7d ethernet stop mode property.
Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
Signed-off-by: NFugang Duan <fugang.duan@nxp.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d009a621

dt-bindings: fec: update the gpr property · 998ec26b

由 Fugang Duan 提交于 5月 26, 2020

- rename the 'gpr' property string to 'fsl,stop-mode'.
- Update the property to define gpr register offset and
bit in DT, since different instance have different gpr bit.

v2:
 * rename 'gpr' property string to 'fsl,stop-mode'.
Signed-off-by: NFugang Duan <fugang.duan@nxp.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

998ec26b

net: ethernet: fec: move GPR register offset and bit into DT · 8a448bf8

由 Fugang Duan 提交于 5月 26, 2020

The commit da722186 (net: fec: set GPR bit on suspend by DT
configuration) set the GPR reigster offset and bit in driver for
wake on lan feature.

But it introduces two issues here:
- one SOC has two instances, they have different bit
- different SOCs may have different offset and bit

So to support wake-on-lan feature on other i.MX platforms, it should
configure the GPR reigster offset and bit from DT.

So the patch is to improve the commit da722186 (net: fec: set GPR
bit on suspend by DT configuration) to support multiple ethernet
instances on i.MX series.

v2:
 * switch back to store the quirks bitmask in driver_data
v3:
 * suggested by Sascha Hauer, use a struct fec_devinfo for
   abstracting differences between different hardware variants,
   it can give more freedom to describe the differences.
Signed-off-by: NFugang Duan <fugang.duan@nxp.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8a448bf8

net/smc: mark smc_pnet_policy as const · 09d0310f

由 Dmitry Vyukov 提交于 5月 25, 2020

Netlink policies are generally declared as const.
This is safer and prevents potential bugs.
Signed-off-by: NDmitry Vyukov <dvyukov@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

09d0310f

Merge tag 'mac80211-next-for-net-next-2020-04-25' of... · 745bd6f4

由 David S. Miller 提交于 5月 26, 2020

Merge tag 'mac80211-next-for-net-next-2020-04-25' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next

Johannes Berg says:

====================
One batch of changes, containing:
 * hwsim improvements from Jouni and myself, to be able to
   test more scenarios easily
 * some more HE (802.11ax) support
 * some initial S1G (sub 1 GHz) work for fractional MHz channels
 * some (action) frame registration updates to help DPP support
 * along with other various improvements/fixes
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

745bd6f4

Merge branch 'net-phy-mscc-miim-reduce-waiting-time-between-MDIO-transactions' · 0e348119

由 David S. Miller 提交于 5月 26, 2020

Antoine Tenart says:

====================
net: phy: mscc-miim: reduce waiting time between MDIO transactions

This series aims at reducing the waiting time between MDIO transactions
when using the MSCC MIIM MDIO controller.

I'm not sure we need patch 4/4 and we could reasonably drop it from the
series. I'm including the patch as it could help to ensure the system
is functional with a non optimal configuration.

We needed to improve the driver's performances as when using a PHY
requiring lots of registers accesses (such as the VSC85xx family),
delays would add up and ended up to be quite large which would cause
issues such as: a slow initialization of the PHY, and issues when using
timestamping operations (this feature will be sent quite soon to the
mailing lists).
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0e348119

net: phy: mscc-miim: read poll when high resolution timers are disabled · a021ada2

由 Antoine Tenart 提交于 5月 26, 2020

The driver uses a read polling mechanism to check the status of the MDIO
bus, to know if it is ready to accept next commands. This polling
mechanism uses usleep_delay() under the hood between reads which is fine
as long as high resolution timers are enabled. Otherwise the delays will
end up to be much longer than expected.

This patch fixes this by using udelay() under the hood when
CONFIG_HIGH_RES_TIMERS isn't enabled. This increases CPU usage.
Signed-off-by: NAntoine Tenart <antoine.tenart@bootlin.com>
Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a021ada2

net: phy: mscc-miim: improve waiting logic · d9c6de35

由 Antoine Tenart 提交于 5月 26, 2020

The MSCC MIIM MDIO driver uses a waiting logic to wait for the MDIO bus
to be ready to accept next commands. It does so by polling the BUSY
status bit which indicates the MDIO bus has completed all pending
operations. This can take time, and the controller supports writing the
next command as soon as there are no pending commands (which happens
while the MDIO bus is busy completing its current command).

This patch implements this improved logic by adding an helper to poll
the PENDING status bit, and by adjusting where we should wait for the
bus to not be busy or to not be pending.
Signed-off-by: NAntoine Tenart <antoine.tenart@bootlin.com>
Reviewed-by: NAlexandre Belloni <alexandre.belloni@bootlin.com>
Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d9c6de35

net: phy: mscc-miim: remove redundant timeout check · f5112c8a

由 Antoine Tenart 提交于 5月 26, 2020

readl_poll_timeout already returns -ETIMEDOUT if the condition isn't
satisfied, there's no need to check again the condition after calling
it. Remove the redundant timeout check.
Signed-off-by: NAntoine Tenart <antoine.tenart@bootlin.com>
Reviewed-by: NAlexandre Belloni <alexandre.belloni@bootlin.com>
Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f5112c8a

net: phy: mscc-miim: use more reasonable delays · 9513167e

由 Antoine Tenart 提交于 5月 26, 2020

The MSCC MIIM MDIO driver uses delays to read poll a status register. I
made multiple tests on a Ocelot PCS120 platform which led me to reduce
those delays. The delay in between which the polling function is allowed
to sleep is reduced from 100us to 50us which in almost all cases is a
good value to succeed at the first retry. The overall delay is also
lowered as the prior value was really way to high, 10000us is large
enough.
Signed-off-by: NAntoine Tenart <antoine.tenart@bootlin.com>
Reviewed-by: NAlexandre Belloni <alexandre.belloni@bootlin.com>
Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9513167e

net: mdiobus: add clause 45 mdiobus accessors · 90ce665c

由 Russell King 提交于 5月 26, 2020

There is a recurring pattern throughout some of the PHY code converting
a devad and regnum to our packed clause 45 representation. Rather than
having this scattered around the code, let's put a common translation
function in mdio.h, and provide some register accessors.

Convert the phylib core, phylink, bcm87xx and cortina to use these.
Signed-off-by: NRussell King <rmk+kernel@armlinux.org.uk>
Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

90ce665c

Merge branch 'flow-mpls' · 8928e19a

由 David S. Miller 提交于 5月 26, 2020

Guillaume Nault says:

====================
flow_dissector, cls_flower: Add support for multiple MPLS Label Stack Entries

Currently, the flow dissector and the Flower classifier can only handle
the first entry of an MPLS label stack. This patch series generalises
the code to allow parsing and matching the Label Stack Entries that
follow.

Patch 1 extends the flow dissector to parse MPLS LSEs until the Bottom
Of Stack bit is reached. The number of parsed LSEs is capped at
FLOW_DIS_MPLS_MAX (arbitrarily set to 7). Flower and the NFP driver
are updated to take into account the new layout of struct
flow_dissector_key_mpls.

Patch 2 extends Flower. It defines new netlink attributes, which are
independent from the previous MPLS ones. Mixing the old and the new
attributes in a same filter is not allowed. For backward compatibility,
the old attributes are used when dumping filters that don't require the
new ones.

Changes since v2:
  * Fix compilation with the new MLX5 bareudp tunnel code.

Changes since v1:
  * Fix compilation of NFP driver (kbuild test robot).
  * Fix sparse warning with entropy label (kbuild test robot).
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8928e19a

cls_flower: Support filtering on multiple MPLS Label Stack Entries · 61aec25a

由 Guillaume Nault 提交于 5月 26, 2020

With struct flow_dissector_key_mpls now recording the first
FLOW_DIS_MPLS_MAX labels, we can extend Flower to filter on any of
these LSEs independently.

In order to avoid creating new netlink attributes for every possible
depth, let's define a new TCA_FLOWER_KEY_MPLS_OPTS nested attribute
that contains the list of LSEs to match. Each LSE is represented by
another attribute, TCA_FLOWER_KEY_MPLS_OPTS_LSE, which then contains
the attributes representing the depth and the MPLS fields to match at
this depth (label, TTL, etc.).

For each MPLS field, the mask is always set to all-ones, as this is
what the original API did. We could allow user configurable masks in
the future if there is demand for more flexibility.

The new API also allows to only specify an LSE depth. In that case,
Flower only verifies that the MPLS label stack depth is greater or
equal to the provided depth (that is, an LSE exists at this depth).

Filters that only match on one (or more) fields of the first LSE are
dumped using the old netlink attributes, to avoid confusing user space
programs that don't understand the new API.
Signed-off-by: NGuillaume Nault <gnault@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

61aec25a

flow_dissector: Parse multiple MPLS Label Stack Entries · 58cff782

由 Guillaume Nault 提交于 5月 26, 2020

The current MPLS dissector only parses the first MPLS Label Stack
Entry (second LSE can be parsed too, but only to set a key_id).

This patch adds the possibility to parse several LSEs by making
__skb_flow_dissect_mpls() return FLOW_DISSECT_RET_PROTO_AGAIN as long
as the Bottom Of Stack bit hasn't been seen, up to a maximum of
FLOW_DIS_MPLS_MAX entries.

FLOW_DIS_MPLS_MAX is arbitrarily set to 7. This should be enough for
many practical purposes, without wasting too much space.

To record the parsed values, flow_dissector_key_mpls is modified to
store an array of stack entries, instead of just the values of the
first one. A bit field, "used_lses", is also added to keep track of
the LSEs that have been set. The objective is to avoid defining a
new FLOW_DISSECTOR_KEY_MPLS_XX for each level of the MPLS stack.

TC flower is adapted for the new struct flow_dissector_key_mpls layout.
Matching on several MPLS Label Stack Entries will be added in the next
patch.

The NFP and MLX5 drivers are also adapted: nfp_flower_compile_mac() and
mlx5's parse_tunnel() now verify that the rule only uses the first LSE
and fail if it doesn't.

Finally, the behaviour of the FLOW_DISSECTOR_KEY_MPLS_ENTROPY key is
slightly modified. Instead of recording the first Entropy Label, it
now records the last one. This shouldn't have any consequences since
there doesn't seem to have any user of FLOW_DISSECTOR_KEY_MPLS_ENTROPY
in the tree. We'd probably better do a hash of all parsed MPLS labels
instead (excluding reserved labels) anyway. That'd give better entropy
and would probably also simplify the code. But that's not the purpose
of this patch, so I'm keeping that as a future possible improvement.
Signed-off-by: NGuillaume Nault <gnault@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

58cff782

Merge tag 'batadv-next-for-davem-20200526' of git://git.open-mesh.org/linux-merge · fb8ddaa9

由 David S. Miller 提交于 5月 26, 2020

Simon Wunderlich says:

====================
This cleanup patchset includes the following patches:

 - Fix revert dynamic lockdep key changes for batman-adv,
   by Sven Eckelmann

 - use rcu_replace_pointer() where appropriate, by Antonio Quartulli

 - Revert "disable ethtool link speed detection when auto negotiation
   off", by Sven Eckelmann
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fb8ddaa9

Merge branch 'tipc-add-some-improvements' · 6a862a44

由 David S. Miller 提交于 5月 26, 2020

Tuong Lien says:

====================
tipc: add some improvements

This series adds some improvements to TIPC.

The first patch improves the TIPC broadcast's performance with the 'Gap
ACK blocks' mechanism similar to unicast before, while the others give
support on tracing & statistics for broadcast links, and an alternative
to carry broadcast retransmissions via unicast which might be useful in
some cases.

Besides, the Nagle algorithm can now automatically 'adjust' itself
depending on the specific network condition a stream connection runs by
the last patch.
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6a862a44

tipc: add test for Nagle algorithm effectiveness · 0a3e060f

由 Tuong Lien 提交于 5月 26, 2020

When streaming in Nagle mode, we try to bundle small messages from user
as many as possible if there is one outstanding buffer, i.e. not ACK-ed
by the receiving side, which helps boost up the overall throughput. So,
the algorithm's effectiveness really depends on when Nagle ACK comes or
what the specific network latency (RTT) is, compared to the user's
message sending rate.

In a bad case, the user's sending rate is low or the network latency is
small, there will not be many bundles, so making a Nagle ACK or waiting
for it is not meaningful.
For example: a user sends its messages every 100ms and the RTT is 50ms,
then for each messages, we require one Nagle ACK but then there is only
one user message sent without any bundles.

In a better case, even if we have a few bundles (e.g. the RTT = 300ms),
but now the user sends messages in medium size, then there will not be
any difference at all, that says 3 x 1000-byte data messages if bundled
will still result in 3 bundles with MTU = 1500.

When Nagle is ineffective, the delay in user message sending is clearly
wasted instead of sending directly.

Besides, adding Nagle ACKs will consume some processor load on both the
sending and receiving sides.

This commit adds a test on the effectiveness of the Nagle algorithm for
an individual connection in the network on which it actually runs.
Particularly, upon receipt of a Nagle ACK we will compare the number of
bundles in the backlog queue to the number of user messages which would
be sent directly without Nagle. If the ratio is good (e.g. >= 2), Nagle
mode will be kept for further message sending. Otherwise, we will leave
Nagle and put a 'penalty' on the connection, so it will have to spend
more 'one-way' messages before being able to re-enter Nagle.

In addition, the 'ack-required' bit is only set when really needed that
the number of Nagle ACKs will be reduced during Nagle mode.

Testing with benchmark showed that with the patch, there was not much
difference in throughput for small messages since the tool continuously
sends messages without a break, so Nagle would still take in effect.
Acked-by: NYing Xue <ying.xue@windriver.com>
Acked-by: NJon Maloy <jmaloy@redhat.com>
Signed-off-by: NTuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0a3e060f

tipc: add support for broadcast rcv stats dumping · 03b6fefd

由 Tuong Lien 提交于 5月 26, 2020

This commit enables dumping the statistics of a broadcast-receiver link
like the traditional 'broadcast-link' one (which is for broadcast-
sender). The link dumping can be triggered via netlink (e.g. the
iproute2/tipc tool) by the link flag - 'TIPC_NLA_LINK_BROADCAST' as the
indicator.

The name of a broadcast-receiver link of a specific peer will be in the
format: 'broadcast-link:<peer-id>'.

For example:

Link <broadcast-link:1001002>
  Window:50 packets
  RX packets:7841 fragments:2408/440 bundles:0/0
  TX packets:0 fragments:0/0 bundles:0/0
  RX naks:0 defs:124 dups:0
  TX naks:21 acks:0 retrans:0
  Congestion link:0  Send queue max:0 avg:0

In addition, the broadcast-receiver link statistics can be reset in the
usual way via netlink by specifying that link name in command.

Note: the 'tipc_link_name_ext()' is removed because the link name can
now be retrieved simply via the 'l->name'.
Acked-by: NYing Xue <ying.xue@windriver.com>
Acked-by: NJon Maloy <jmaloy@redhat.com>
Signed-off-by: NTuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

03b6fefd

tipc: enable broadcast retrans via unicast · a91d55d1

由 Tuong Lien 提交于 5月 26, 2020

In some environment, broadcast traffic is suppressed at high rate (i.e.
a kind of bandwidth limit setting). When it is applied, TIPC broadcast
can still run successfully. However, when it comes to a high load, some
packets will be dropped first and TIPC tries to retransmit them but the
packet retransmission is intentionally broadcast too, so making things
worse and not helpful at all.

This commit enables the broadcast retransmission via unicast which only
retransmits packets to the specific peer that has really reported a gap
i.e. not broadcasting to all nodes in the cluster, so will prevent from
being suppressed, and also reduce some overheads on the other peers due
to duplicates, finally improve the overall TIPC broadcast performance.

Note: the functionality can be turned on/off via the sysctl file:

echo 1 > /proc/sys/net/tipc/bc_retruni
echo 0 > /proc/sys/net/tipc/bc_retruni

Default is '0', i.e. the broadcast retransmission still works as usual.
Acked-by: NYing Xue <ying.xue@windriver.com>
Acked-by: NJon Maloy <jmaloy@redhat.com>
Signed-off-by: NTuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a91d55d1

tipc: add back link trace events · c6ed7a5c

由 Tuong Lien 提交于 5月 26, 2020

In the previous commit ("tipc: add Gap ACK blocks support for broadcast
link"), we have removed the following link trace events due to the code
changes:

- tipc_link_bc_ack
- tipc_link_retrans

This commit adds them back along with some minor changes to adapt to
the new code.
Acked-by: NYing Xue <ying.xue@windriver.com>
Acked-by: NJon Maloy <jmaloy@redhat.com>
Signed-off-by: NTuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c6ed7a5c

tipc: introduce Gap ACK blocks for broadcast link · d7626b5a

由 Tuong Lien 提交于 5月 26, 2020

As achieved through commit 9195948f ("tipc: improve TIPC throughput
by Gap ACK blocks"), we apply the same mechanism for the broadcast link
as well. The 'Gap ACK blocks' data field in a 'PROTOCOL/STATE_MSG' will
consist of two parts built for both the broadcast and unicast types:

 31                       16 15                        0
+-------------+-------------+-------------+-------------+
|  bgack_cnt  |  ugack_cnt  |            len            |
+-------------+-------------+-------------+-------------+  -
|            gap            |            ack            |   |
+-------------+-------------+-------------+-------------+    > bc gacks
:                           :                           :   |
+-------------+-------------+-------------+-------------+  -
|            gap            |            ack            |   |
+-------------+-------------+-------------+-------------+    > uc gacks
:                           :                           :   |
+-------------+-------------+-------------+-------------+  -

which is "automatically" backward-compatible.

We also increase the max number of Gap ACK blocks to 128, allowing upto
64 blocks per type (total buffer size = 516 bytes).

Besides, the 'tipc_link_advance_transmq()' function is refactored which
is applicable for both the unicast and broadcast cases now, so some old
functions can be removed and the code is optimized.

With the patch, TIPC broadcast is more robust regardless of packet loss
or disorder, latency, ... in the underlying network. Its performance is
boost up significantly.
For example, experiment with a 5% packet loss rate results:

$ time tipc-pipe --mc --rdm --data_size 123 --data_num 1500000
real    0m 42.46s
user    0m 1.16s
sys     0m 17.67s

Without the patch:

$ time tipc-pipe --mc --rdm --data_size 123 --data_num 1500000
real    8m 27.94s
user    0m 0.55s
sys     0m 2.38s
Acked-by: NYing Xue <ying.xue@windriver.com>
Acked-by: NJon Maloy <jmaloy@redhat.com>
Signed-off-by: NTuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d7626b5a

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功