提交 · 0da1a1c4891188f456c7790940e47c8043bc7c9b · openeuler / Kernel

21 10月, 2021 4 次提交

net: mscc: ocelot: allow a config where all bridge VLANs are egress-untagged · 0da1a1c4

由 Vladimir Oltean 提交于 10月 20, 2021

At present, the ocelot driver accepts a single egress-untagged bridge
VLAN, meaning that this sequence of operations:

ip link add br0 type bridge vlan_filtering 1
ip link set swp0 master br0
bridge vlan add dev swp0 vid 2 pvid untagged

fails because the bridge automatically installs VID 1 as a pvid & untagged
VLAN, and vid 2 would be the second untagged VLAN on this port. It is
necessary to delete VID 1 before proceeding to add VID 2.

This limitation comes from the fact that we operate the port tag, when
it has an egress-untagged VID, in the OCELOT_PORT_TAG_NATIVE mode.
The ocelot switches do not have full flexibility and can either have one
single VID as egress-untagged, or all of them.

There are use cases for having all VLANs as egress-untagged as well, and
this patch adds support for that.

The change rewrites ocelot_port_set_native_vlan() into a more generic
ocelot_port_manage_port_tag() function. Because the software bridge's
state, transmitted to us via switchdev, can become very complex, we
don't attempt to track all possible state transitions, but instead take
a more declarative approach and just make ocelot_port_manage_port_tag()
figure out which more to operate in:

- port is VLAN-unaware: the classified VLAN (internal, unrelated to the
                        802.1Q header) is not inserted into packets on egress
- port is VLAN-aware:
  - port has tagged VLANs:
    -> port has no untagged VLAN: set up as pure trunk
    -> port has one untagged VLAN: set up as trunk port + native VLAN
    -> port has more than one untagged VLAN: this is an invalid config
       which is rejected by ocelot_vlan_prepare
  - port has no tagged VLANs
    -> set up as pure egress-untagged port

We don't keep the number of tagged and untagged VLANs, we just count the
structures we keep.
Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0da1a1c4

net: mscc: ocelot: convert the VLAN masks to a list · 90e0aa8d

由 Vladimir Oltean 提交于 10月 20, 2021

First and foremost, the driver currently allocates a constant sized
4K * u32 (16KB memory) array for the VLAN masks. However, a typical
application might not need so many VLANs, so if we dynamically allocate
the memory as needed, we might actually save some space.

Secondly, we'll need to keep more advanced bookkeeping of the VLANs we
have, notably we'll have to check how many untagged and how many tagged
VLANs we have. This will have to stay in a structure, and allocating
another 16 KB array for that is again a bit too much.

So refactor the bridge VLANs in a linked list of structures.

The hook points inside the driver are ocelot_vlan_member_add() and
ocelot_vlan_member_del(), which previously used to operate on the
ocelot->vlan_mask[vid] array element.

ocelot_vlan_member_add() and ocelot_vlan_member_del() used to call
ocelot_vlan_member_set() to commit to the ocelot->vlan_mask.
Additionally, we had two calls to ocelot_vlan_member_set() from outside
those callers, and those were directly from ocelot_vlan_init().
Those calls do not set up bridging service VLANs, instead they:

- clear the VLAN table on reset
- set the port pvid to the value used by this driver for VLAN-unaware
  standalone port operation (VID 0)

So now, when we have a structure which represents actual bridge VLANs,
VID 0 doesn't belong in that structure, since it is not part of the
bridging layer.

So delete the middle man, ocelot_vlan_member_set(), and let
ocelot_vlan_init() call directly ocelot_vlant_set_mask() which forgoes
any data structure and writes directly to hardware, which is all that we
need.
Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

90e0aa8d

net: mscc: ocelot: add a type definition for REW_TAG_CFG_TAG_CFG · 62a22bcb

由 Vladimir Oltean 提交于 10月 20, 2021

This is a cosmetic patch which clarifies what are the port tagging
options for Ocelot switches.
Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

62a22bcb

fq_codel: generalise ce_threshold marking for subset of traffic · dfcb63ce

由 Toke Høiland-Jørgensen 提交于 10月 19, 2021

Commit e72aeb9e ("fq_codel: implement L4S style ce_threshold_ect1
marking") expanded the ce_threshold feature of FQ-CoDel so it can
be applied to a subset of the traffic, using the ECT(1) bit of the ECN
field as the classifier. However, hard-coding ECT(1) as the only
classifier for this feature seems limiting, so let's expand it to be more
general.

To this end, change the parameter from a ce_threshold_ect1 boolean, to a
one-byte selector/mask pair (ce_threshold_{selector,mask}) which is applied
to the whole diffserv/ECN field in the IP header. This makes it possible to
classify packets by any value in either the ECN field or the diffserv
field. In particular, setting a selector of INET_ECN_ECT_1 and a mask of
INET_ECN_MASK corresponds to the functionality before this patch, and a
mask of ~INET_ECN_MASK allows using the selector as a straight-forward
match against a diffserv code point:

 # apply ce_threshold to ECT(1) traffic
 tc qdisc replace dev eth0 root fq_codel ce_threshold 1ms ce_threshold_selector 0x1/0x3

 # apply ce_threshold to ECN-capable traffic marked as diffserv AF22
 tc qdisc replace dev eth0 root fq_codel ce_threshold 1ms ce_threshold_selector 0x50/0xfc

Regardless of the selector chosen, the normal rules for ECN-marking of
packets still apply, i.e., the flow must still declare itself ECN-capable
by setting one of the bits in the ECN field to get marked at all.

v2:
- Add tc usage examples to patch description
Signed-off-by: NToke Høiland-Jørgensen <toke@redhat.com>
Reviewed-by: NEric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20211019174709.69081-1-toke@redhat.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

dfcb63ce

20 10月, 2021 2 次提交

net: sched: remove one pair of atomic operations · 97604c65

由 Eric Dumazet 提交于 10月 18, 2021

__QDISC_STATE_RUNNING is only set/cleared from contexts owning qdisc lock.

Thus we can use less expensive bit operations, as we were doing
before commit f9eb8aea ("net_sched: transform qdisc running bit into a seqcount")

Fixes: 29cbcd85 ("net: sched: Remove Qdisc::running sequence counter")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Ahmed S. Darwish <a.darwish@linutronix.de>
Acked-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
Tested-by: NToke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

97604c65

net: sched: fix logic error in qdisc_run_begin() · 4c57e2fa

由 Eric Dumazet 提交于 10月 18, 2021

For non TCQ_F_NOLOCK qdisc, qdisc_run_begin() tries to set
__QDISC_STATE_RUNNING and should return true if the bit was not set.

test_and_set_bit() returns old bit value, therefore we need to invert.

Fixes: 29cbcd85 ("net: sched: Remove Qdisc::running sequence counter")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Ahmed S. Darwish <a.darwish@linutronix.de>
Tested-by: NIdo Schimmel <idosch@nvidia.com>
Acked-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
Tested-by: NToke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

4c57e2fa

19 10月, 2021 5 次提交

ethernet: add a helper for assigning port addresses · e80094a4

由 Jakub Kicinski 提交于 10月 18, 2021

We have 5 drivers which offset base MAC addr by port id.
Create a helper for them.

This helper takes care of overflows, which some drivers
did not do, please complain if that's going to break
anything!
Signed-off-by: NJakub Kicinski <kuba@kernel.org>
Reviewed-by: NVladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: NShannon Nelson <snelson@pensando.io>
Reviewed-by: NIdo Schimmel <idosch@nvidia.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e80094a4

net: sch_tbf: Add a graft command · 6b3efbfa

由 Petr Machata 提交于 10月 19, 2021

As another qdisc is linked to the TBF, the latter should issue an event to
give drivers a chance to react to the grafting. In other qdiscs, this event
is called GRAFT, so follow suit with TBF as well.
Signed-off-by: NPetr Machata <petrm@nvidia.com>
Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6b3efbfa

net/mlx5: Introduce new uplink destination type · 58a606db

由 Maor Gottlieb 提交于 8月 03, 2021

The uplink destination type should be used in rules to steer the
packet to the uplink when the device is in steering based LAG mode.
Signed-off-by: NMaor Gottlieb <maorg@nvidia.com>
Reviewed-by: NMark Bloch <mbloch@nvidia.com>
Signed-off-by: NSaeed Mahameed <saeedm@nvidia.com>

58a606db

net/mlx5: Add support to create match definer · e7e2519e

由 Maor Gottlieb 提交于 7月 06, 2021

Introduce new APIs to create and destroy flow matcher
for given format id.

Flow match definer object is used for defining the fields and
mask used for the hash calculation. User should mask the desired
fields like done in the match criteria.

This object is assigned to flow group of type hash. In this flow
group type, packets lookup is done based on the hash result.

This patch also adds the required bits to create such flow group.
Signed-off-by: NMaor Gottlieb <maorg@nvidia.com>
Reviewed-by: NMark Bloch <mbloch@nvidia.com>
Signed-off-by: NSaeed Mahameed <saeedm@nvidia.com>

e7e2519e

net/mlx5: Introduce port selection namespace · 425a563a

由 Maor Gottlieb 提交于 5月 23, 2021

Add new port selection flow steering namespace. Flow steering rules in
this namespaceare are used to determine the physical port for egress
packets.
Signed-off-by: NMaor Gottlieb <maorg@nvidia.com>
Reviewed-by: NMark Bloch <mbloch@nvidia.com>
Signed-off-by: NSaeed Mahameed <saeedm@nvidia.com>

425a563a

18 10月, 2021 10 次提交

net: dsa: tag_rtl8_4: add realtek 8 byte protocol 4 tag · 1521d5ad

由 Alvin Šipraga 提交于 10月 18, 2021

This commit implements a basic version of the 8 byte tag protocol used
in the Realtek RTL8365MB-VC unmanaged switch, which carries with it a
protocol version of 0x04.

The implementation itself only handles the parsing of the EtherType
value and Realtek protocol version, together with the source or
destination port fields. The rest is left unimplemented for now.

The tag format is described in a confidential document provided to my
company by Realtek Semiconductor Corp. Permission has been granted by
the vendor to publish this driver based on that material, together with
an extract from the document describing the tag format and its fields.
It is hoped that this will help future implementors who do not have
access to the material but who wish to extend the functionality of
drivers for chips which use this protocol.

In addition, two possible values of the REASON field are specified,
based on experiments on my end. Realtek does not specify what value this
field can take.
Signed-off-by: NAlvin Šipraga <alsi@bang-olufsen.dk>
Reviewed-by: NVladimir Oltean <olteanv@gmail.com>
Reviewed-by: NLinus Walleij <linus.walleij@linaro.org>
Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
Tested-by: NArınç ÜNAL <arinc.unal@arinc9.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1521d5ad

net: dsa: allow reporting of standard ethtool stats for slave devices · 487d3855

由 Alvin Šipraga 提交于 10月 18, 2021

Jakub pointed out that we have a new ethtool API for reporting device
statistics in a standardized way, via .get_eth_{phy,mac,ctrl}_stats.
Add a small amount of plumbing to allow DSA drivers to take advantage of
this when exposing statistics.
Suggested-by: NJakub Kicinski <kuba@kernel.org>
Signed-off-by: NAlvin Šipraga <alsi@bang-olufsen.dk>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

487d3855

ether: add EtherType for proprietary Realtek protocols · 7bbbbfaa

由 Alvin Šipraga 提交于 10月 18, 2021

Add a new EtherType ETH_P_REALTEK to the if_ether.h uapi header. The
EtherType 0x8899 is used in a number of different protocols from Realtek
Semiconductor Corp [1], so no general assumptions should be made when
trying to decode such packets. Observed protocols include:

  0x1 - Realtek Remote Control protocol [2]
  0x2 - Echo protocol [2]
  0x3 - Loop detection protocol [2]
  0x4 - RTL8365MB 4- and 8-byte switch CPU tag protocols [3]
  0x9 - RTL8306 switch CPU tag protocol [4]
  0xA - RTL8366RB switch CPU tag protocol [4]

[1] https://lore.kernel.org/netdev/CACRpkdYQthFgjwVzHyK3DeYUOdcYyWmdjDPG=Rf9B3VrJ12Rzg@mail.gmail.com/
[2] https://www.wireshark.org/lists/ethereal-dev/200409/msg00090.html
[3] https://lore.kernel.org/netdev/20210822193145.1312668-4-alvin@pqrs.dk/
[4] https://lore.kernel.org/netdev/20200708122537.1341307-2-linus.walleij@linaro.org/Suggested-by: NAndrew Lunn <andrew@lunn.ch>
Signed-off-by: NAlvin Šipraga <alsi@bang-olufsen.dk>
Reviewed-by: NVladimir Oltean <olteanv@gmail.com>
Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7bbbbfaa

net: sched: Remove Qdisc::running sequence counter · 29cbcd85

由 Ahmed S. Darwish 提交于 10月 16, 2021

The Qdisc::running sequence counter has two uses:

  1. Reliably reading qdisc's tc statistics while the qdisc is running
     (a seqcount read/retry loop at gnet_stats_add_basic()).

  2. As a flag, indicating whether the qdisc in question is running
     (without any retry loops).

For the first usage, the Qdisc::running sequence counter write section,
qdisc_run_begin() => qdisc_run_end(), covers a much wider area than what
is actually needed: the raw qdisc's bstats update. A u64_stats sync
point was thus introduced (in previous commits) inside the bstats
structure itself. A local u64_stats write section is then started and
stopped for the bstats updates.

Use that u64_stats sync point mechanism for the bstats read/retry loop
at gnet_stats_add_basic().

For the second qdisc->running usage, a __QDISC_STATE_RUNNING bit flag,
accessed with atomic bitops, is sufficient. Using a bit flag instead of
a sequence counter at qdisc_run_begin/end() and qdisc_is_running() leads
to the SMP barriers implicitly added through raw_read_seqcount() and
write_seqcount_begin/end() getting removed. All call sites have been
surveyed though, and no required ordering was identified.

Now that the qdisc->running sequence counter is no longer used, remove
it.

Note, using u64_stats implies no sequence counter protection for 64-bit
architectures. This can lead to the qdisc tc statistics "packets" vs.
"bytes" values getting out of sync on rare occasions. The individual
values will still be valid.
Signed-off-by: NAhmed S. Darwish <a.darwish@linutronix.de>
Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

29cbcd85

net: sched: Merge Qdisc::bstats and Qdisc::cpu_bstats data types · 50dc9a85

由 Ahmed S. Darwish 提交于 10月 16, 2021

The only factor differentiating per-CPU bstats data type (struct
gnet_stats_basic_cpu) from the packed non-per-CPU one (struct
gnet_stats_basic_packed) was a u64_stats sync point inside the former.
The two data types are now equivalent: earlier commits added a u64_stats
sync point to the latter.

Combine both data types into "struct gnet_stats_basic_sync". This
eliminates redundancy and simplifies the bstats read/write APIs.

Use u64_stats_t for bstats "packets" and "bytes" data types. On 64-bit
architectures, u64_stats sync points do not use sequence counter
protection.
Signed-off-by: NAhmed S. Darwish <a.darwish@linutronix.de>
Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

50dc9a85

net: sched: Protect Qdisc::bstats with u64_stats · 67c9e627

由 Ahmed S. Darwish 提交于 10月 16, 2021

The not-per-CPU variant of qdisc tc (traffic control) statistics,
Qdisc::gnet_stats_basic_packed bstats, is protected with Qdisc::running
sequence counter.

This sequence counter is used for reliably protecting bstats reads from
parallel writes. Meanwhile, the seqcount's write section covers a much
wider area than bstats update: qdisc_run_begin() => qdisc_run_end().

That read/write section asymmetry can lead to needless retries of the
read section. To prepare for removing the Qdisc::running sequence
counter altogether, introduce a u64_stats sync point inside bstats
instead.

Modify _bstats_update() to start/end the bstats u64_stats write
section.

For bisectability, and finer commits granularity, the bstats read
section is still protected with a Qdisc::running read/retry loop and
qdisc_run_begin/end() still starts/ends that seqcount write section.
Once all call sites are modified to use _bstats_update(), the
Qdisc::running seqcount will be removed and bstats read/retry loop will
be modified to utilize the internal u64_stats sync point.

Note, using u64_stats implies no sequence counter protection for 64-bit
architectures. This can lead to the statistics "packets" vs. "bytes"
values getting out of sync on rare occasions. The individual values will
still be valid.

[bigeasy: Minor commit message edits, init all gnet_stats_basic_packed.]
Signed-off-by: NAhmed S. Darwish <a.darwish@linutronix.de>
Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

67c9e627

u64_stats: Introduce u64_stats_set() · f2efdb17

由 Ahmed S. Darwish 提交于 10月 16, 2021

Allow to directly set a u64_stats_t value which is used to provide an init
function which sets it directly to zero intead of memset() the value.

Add u64_stats_set() to the u64_stats API.

[bigeasy: commit message. ]
Signed-off-by: NAhmed S. Darwish <a.darwish@linutronix.de>
Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f2efdb17

gen_stats: Move remaining users to gnet_stats_add_queue(). · 10940eb7

由 Sebastian Andrzej Siewior 提交于 10月 16, 2021

The gnet_stats_queue::qlen member is only used in the SMP-case.

qdisc_qstats_qlen_backlog() needs to add qdisc_qlen() to qstats.qlen to
have the same value as that provided by qdisc_qlen_sum().

gnet_stats_copy_queue() needs to overwritte the resulting qstats.qlen
field whith the caller submitted qlen value. It might be differ from the
submitted value.

Let both functions use gnet_stats_add_queue() and remove unused
__gnet_stats_copy_queue().
Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

10940eb7

gen_stats: Add gnet_stats_add_queue(). · 448e163f

由 Sebastian Andrzej Siewior 提交于 10月 16, 2021

This function will replace __gnet_stats_copy_queue(). It reads all
arguments and adds them into the passed gnet_stats_queue argument.
In contrast to __gnet_stats_copy_queue() it also copies the qlen member.
Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

448e163f

gen_stats: Add instead Set the value in __gnet_stats_copy_basic(). · fbf307c8

由 Sebastian Andrzej Siewior 提交于 10月 16, 2021

__gnet_stats_copy_basic() always assigns the value to the bstats
argument overwriting the previous value. The later added per-CPU version
always accumulated the values in the returning gnet_stats_basic_packed
argument.

Based on review there are five users of that function as of today:
- est_fetch_counters(), ___gnet_stats_copy_basic()
  memsets() bstats to zero, single invocation.

- mq_dump(), mqprio_dump(), mqprio_dump_class_stats()
  memsets() bstats to zero, multiple invocation but does not use the
  function due to !qdisc_is_percpu_stats().

Add the values in __gnet_stats_copy_basic() instead overwriting. Rename
the function to gnet_stats_add_basic() to make it more obvious.
Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fbf307c8

16 10月, 2021 6 次提交

net/smc: add netlink support for SMC-Rv2 · b0539f5e

由 Karsten Graul 提交于 10月 16, 2021

Implement the netlink support for SMC-Rv2 related attributes that are
provided to user space.
Signed-off-by: NKarsten Graul <kgraul@linux.ibm.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b0539f5e

net/mlx5: Use native_port_num as 1st option of device index · 1021d064

由 Rongwei Liu 提交于 10月 12, 2021

Using "native_port_num" can support more NICs.

Fallback to PCIe IDs if "native_port_num" query fails.
Signed-off-by: NRongwei Liu <rongweil@nvidia.com>
Reviewed-by: NMark Bloch <mbloch@nvidia.com>
Signed-off-by: NSaeed Mahameed <saeedm@nvidia.com>

1021d064

net/mlx5: Introduce new device index wrapper · 2ec16ddd

由 Rongwei Liu 提交于 9月 16, 2021

Downstream patches.
Signed-off-by: NRongwei Liu <rongweil@nvidia.com>
Reviewed-by: NMark Bloch <mbloch@nvidia.com>
Signed-off-by: NSaeed Mahameed <saeedm@nvidia.com>

2ec16ddd

net/mlx5: Disable roce at HCA level · fbfa97b4

由 Shay Drory 提交于 8月 18, 2021

Currently, when a user disables roce via the devlink param, this change
isn't passed down to the device.
If device allows disabling RoCE at device level, make use of it. This
instructs the device to skip memory allocations related to RoCE
functionality which otherwise is done by the device.
Signed-off-by: NShay Drory <shayd@nvidia.com>
Reviewed-by: NParav Pandit <parav@nvidia.com>
Signed-off-by: NSaeed Mahameed <saeedm@nvidia.com>

fbfa97b4

net/mlx5: Read timeout values from init segment · 5945e1ad

由 Amir Tzin 提交于 10月 07, 2021

Replace hard coded timeouts with values stored in firmware's init
segment. Timeouts are read from init segment during driver load. If init
segment timeouts are not supported then fallback to hard coded defaults
instead. Also move pre initialization timeouts which cannot be read from
firmware to the new mechanism.
Signed-off-by: NAmir Tzin <amirtz@mellanox.com>
Reviewed-by: NMoshe Shemesh <moshe@nvidia.com>
Signed-off-by: NSaeed Mahameed <saeedm@nvidia.com>

5945e1ad

net/mlx5: Add layout to support default timeouts register · 4b2c5fa9

由 Amir Tzin 提交于 7月 21, 2021

Add needed structures and defines for DTOR (default timeouts register).
This will be used to get timeouts values from FW instead of hard coded
values in the driver code thus enabling support for slower devices which
need longer timeouts.
Signed-off-by: NAmir Tzin <amirtz@nvidia.com>
Reviewed-by: NMoshe Shemesh <moshe@nvidia.com>
Signed-off-by: NSaeed Mahameed <saeedm@nvidia.com>

4b2c5fa9

15 10月, 2021 13 次提交

soc: fsl: dpio: add Net DIM integration · 69651bd8

由 Ioana Ciornei 提交于 10月 15, 2021

Use the generic dynamic interrupt moderation (dim) framework to
implement adaptive interrupt coalescing on Rx. With the per-packet
interrupt scheme, a high interrupt rate has been noted for moderate
traffic flows leading to high CPU utilization.

The dpio driver exports new functions to enable/disable adaptive IRQ
coalescing on a DPIO object, to query the state or to update Net DIM
with a new set of bytes and frames dequeued.
Signed-off-by: NIoana Ciornei <ioana.ciornei@nxp.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

69651bd8

soc: fsl: dpio: add support for irq coalescing per software portal · ed1d2143

由 Ioana Ciornei 提交于 10月 15, 2021

In DPAA2 based SoCs, the IRQ coalesing support per software portal has 2
configurable parameters:
 - the IRQ timeout period (QBMAN_CINH_SWP_ITPR): how many 256 QBMAN
   cycles need to pass until a dequeue interrupt is asserted.
 - the IRQ threshold (QBMAN_CINH_SWP_DQRR_ITR): how many dequeue
   responses in the DQRR ring would generate an IRQ.

Add support for setting up and querying these IRQ coalescing related
parameters.
Signed-off-by: NIoana Ciornei <ioana.ciornei@nxp.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ed1d2143

soc: fsl: dpio: extract the QBMAN clock frequency from the attributes · 2cf0b6fe

由 Ioana Ciornei 提交于 10月 15, 2021

Through the dpio_get_attributes() firmware call the dpio driver has
access to the QBMAN clock frequency. Extend the structure which holds
the firmware's response so that we can have access to this information.

This will be needed in the next patches which also add support for
interrupt coalescing which needs to be configured based on the
frequency.
Signed-off-by: NIoana Ciornei <ioana.ciornei@nxp.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2cf0b6fe

fq_codel: implement L4S style ce_threshold_ect1 marking · e72aeb9e

由 Eric Dumazet 提交于 10月 14, 2021

Add TCA_FQ_CODEL_CE_THRESHOLD_ECT1 boolean option to select Low Latency,
Low Loss, Scalable Throughput (L4S) style marking, along with ce_threshold.

If enabled, only packets with ECT(1) can be transformed to CE
if their sojourn time is above the ce_threshold.

Note that this new option does not change rules for codel law.
In particular, if TCA_FQ_CODEL_ECN is left enabled (this is
the default when fq_codel qdisc is created), ECT(0) packets can
still get CE if codel law (as governed by limit/target) decides so.

Section 4.3.b of current draft [1] states:

b.  A scheduler with per-flow queues such as FQ-CoDel or FQ-PIE can
    be used for L4S.  For instance within each queue of an FQ-CoDel
    system, as well as a CoDel AQM, there is typically also ECN
    marking at an immediate (unsmoothed) shallow threshold to support
    use in data centres (see Sec.5.2.7 of [RFC8290]).  This can be
    modified so that the shallow threshold is solely applied to
    ECT(1) packets.  Then if there is a flow of non-ECN or ECT(0)
    packets in the per-flow-queue, the Classic AQM (e.g.  CoDel) is
    applied; while if there is a flow of ECT(1) packets in the queue,
    the shallower (typically sub-millisecond) threshold is applied.

Tested:

tc qd replace dev eth1 root fq_codel ce_threshold_ect1 50usec

netperf ... -t TCP_STREAM -- K dctcp

tc -s -d qd sh dev eth1
qdisc fq_codel 8022: root refcnt 32 limit 10240p flows 1024 quantum 9212 target 5ms ce_threshold_ect1 49us interval 100ms memory_limit 32Mb ecn drop_batch 64
 Sent 14388596616 bytes 9543449 pkt (dropped 0, overlimits 0 requeues 152013)
 backlog 0b 0p requeues 152013
  maxpacket 68130 drop_overlimit 0 new_flow_count 95678 ecn_mark 0 ce_mark 7639
  new_flows_len 0 old_flows_len 0

[1] L4S current draft:
https://datatracker.ietf.org/doc/html/draft-ietf-tsvwg-l4s-archSigned-off-by: NEric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Ingemar Johansson S <ingemar.s.johansson@ericsson.com>
Cc: Tom Henderson <tomh@tomh.org>
Cc: Bob Briscoe <in@bobbriscoe.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e72aeb9e

net: add skb_get_dsfield() helper · 70e939dd

由 Eric Dumazet 提交于 10月 14, 2021

skb_get_dsfield(skb) gets dsfield from skb, or -1
if an error was found.

This is basically a wrapper around ipv4_get_dsfield()
and ipv6_get_dsfield().

Used by following patch for fq_codel.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Ingemar Johansson S <ingemar.s.johansson@ericsson.com>
Cc: Tom Henderson <tomh@tomh.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

70e939dd

tcp: switch orphan_count to bare per-cpu counters · 19757ceb

由 Eric Dumazet 提交于 10月 14, 2021

Use of percpu_counter structure to track count of orphaned
sockets is causing problems on modern hosts with 256 cpus
or more.

Stefan Bach reported a serious spinlock contention in real workloads,
that I was able to reproduce with a netfilter rule dropping
incoming FIN packets.

    53.56%  server  [kernel.kallsyms]      [k] queued_spin_lock_slowpath
            |
            ---queued_spin_lock_slowpath
               |
                --53.51%--_raw_spin_lock_irqsave
                          |
                           --53.51%--__percpu_counter_sum
                                     tcp_check_oom
                                     |
                                     |--39.03%--__tcp_close
                                     |          tcp_close
                                     |          inet_release
                                     |          inet6_release
                                     |          sock_close
                                     |          __fput
                                     |          ____fput
                                     |          task_work_run
                                     |          exit_to_usermode_loop
                                     |          do_syscall_64
                                     |          entry_SYSCALL_64_after_hwframe
                                     |          __GI___libc_close
                                     |
                                      --14.48%--tcp_out_of_resources
                                                tcp_write_timeout
                                                tcp_retransmit_timer
                                                tcp_write_timer_handler
                                                tcp_write_timer
                                                call_timer_fn
                                                expire_timers
                                                __run_timers
                                                run_timer_softirq
                                                __softirqentry_text_start

As explained in commit cf86a086 ("net/dst: use a smaller percpu_counter
batch for dst entries accounting"), default batch size is too big
for the default value of tcp_max_orphans (262144).

But even if we reduce batch sizes, there would still be cases
where the estimated count of orphans is beyond the limit,
and where tcp_too_many_orphans() has to call the expensive
percpu_counter_sum_positive().

One solution is to use plain per-cpu counters, and have
a timer to periodically refresh this cache.

Updating this cache every 100ms seems about right, tcp pressure
state is not radically changing over shorter periods.

percpu_counter was nice 15 years ago while hosts had less
than 16 cpus, not anymore by current standards.

v2: Fix the build issue for CONFIG_CRYPTO_DEV_CHELSIO_TLS=m,
    reported by kernel test robot <lkp@intel.com>
    Remove unused socket argument from tcp_too_many_orphans()

Fixes: dd24c001 ("net: Use a percpu_counter for orphan_count")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Reported-by: NStefan Bach <sfb@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

19757ceb

page_pool: disable dma mapping support for 32-bit arch with 64-bit DMA · d00e60ee

由 Yunsheng Lin 提交于 10月 13, 2021

As the 32-bit arch with 64-bit DMA seems to rare those days,
and page pool might carry a lot of code and complexity for
systems that possibly.

So disable dma mapping support for such systems, if drivers
really want to work on such systems, they have to implement
their own DMA-mapping fallback tracking outside page_pool.
Reviewed-by: NIlias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: NYunsheng Lin <linyunsheng@huawei.com>
Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d00e60ee

net: of: fix stub of_net helpers for CONFIG_NET=n · 8b017fbe

由 Arnd Bergmann 提交于 10月 14, 2021

Moving the of_net code from drivers/of/ to net/core means we
no longer stub out the helpers when networking is disabled,
which leads to a randconfig build failure with at least one
ARM platform that calls this from non-networking code:

arm-linux-gnueabi-ld: arch/arm/mach-mvebu/kirkwood.o: in function `kirkwood_dt_eth_fixup':
kirkwood.c:(.init.text+0x54): undefined reference to `of_get_mac_address'

Restore the way this worked before by changing that #ifdef
check back to testing for both CONFIG_OF and CONFIG_NET.

Fixes: e330fb14 ("of: net: move of_net under net/")
Signed-off-by: NArnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20211014090055.2058949-1-arnd@kernel.orgSigned-off-by: NJakub Kicinski <kuba@kernel.org>

8b017fbe

netfilter: ebtables: allow use of ebt_do_table as hookfn · f0d6764f

由 Florian Westphal 提交于 10月 11, 2021

This is possible now that the xt_table structure is passed via *priv.
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

f0d6764f

netfilter: ip6tables: allow use of ip6t_do_table as hookfn · 44b5990e

由 Florian Westphal 提交于 10月 11, 2021

This is possible now that the xt_table structure is passed via *priv.
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

44b5990e

netfilter: arp_tables: allow use of arpt_do_table as hookfn · e8d225b6

由 Florian Westphal 提交于 10月 11, 2021

This is possible now that the xt_table structure is passed in via *priv.
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

e8d225b6

netfilter: iptables: allow use of ipt_do_table as hookfn · 8844e010

由 Florian Westphal 提交于 10月 11, 2021

This is possible now that the xt_table structure is passed in via *priv.
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

8844e010

netfilter: Introduce egress hook · 42df6e1d

由 Lukas Wunner 提交于 10月 08, 2021

Support classifying packets with netfilter on egress to satisfy user
requirements such as:
* outbound security policies for containers (Laura)
* filtering and mangling intra-node Direct Server Return (DSR) traffic
  on a load balancer (Laura)
* filtering locally generated traffic coming in through AF_PACKET,
  such as local ARP traffic generated for clustering purposes or DHCP
  (Laura; the AF_PACKET plumbing is contained in a follow-up commit)
* L2 filtering from ingress and egress for AVB (Audio Video Bridging)
  and gPTP with nftables (Pablo)
* in the future: in-kernel NAT64/NAT46 (Pablo)

The egress hook introduced herein complements the ingress hook added by
commit e687ad60 ("netfilter: add netfilter ingress hook after
handle_ing() under unique static key").  A patch for nftables to hook up
egress rules from user space has been submitted separately, so users may
immediately take advantage of the feature.

Alternatively or in addition to netfilter, packets can be classified
with traffic control (tc).  On ingress, packets are classified first by
tc, then by netfilter.  On egress, the order is reversed for symmetry.
Conceptually, tc and netfilter can be thought of as layers, with
netfilter layered above tc.

Traffic control is capable of redirecting packets to another interface
(man 8 tc-mirred).  E.g., an ingress packet may be redirected from the
host namespace to a container via a veth connection:
tc ingress (host) -> tc egress (veth host) -> tc ingress (veth container)

In this case, netfilter egress classifying is not performed when leaving
the host namespace!  That's because the packet is still on the tc layer.
If tc redirects the packet to a physical interface in the host namespace
such that it leaves the system, the packet is never subjected to
netfilter egress classifying.  That is only logical since it hasn't
passed through netfilter ingress classifying either.

Packets can alternatively be redirected at the netfilter layer using
nft fwd.  Such a packet *is* subjected to netfilter egress classifying
since it has reached the netfilter layer.

Internally, the skb->nf_skip_egress flag controls whether netfilter is
invoked on egress by __dev_queue_xmit().  Because __dev_queue_xmit() may
be called recursively by tunnel drivers such as vxlan, the flag is
reverted to false after sch_handle_egress().  This ensures that
netfilter is applied both on the overlay and underlying network.

Interaction between tc and netfilter is possible by setting and querying
skb->mark.

If netfilter egress classifying is not enabled on any interface, it is
patched out of the data path by way of a static_key and doesn't make a
performance difference that is discernible from noise:

Before:             1537 1538 1538 1537 1538 1537 Mb/sec
After:              1536 1534 1539 1539 1539 1540 Mb/sec
Before + tc accept: 1418 1418 1418 1419 1419 1418 Mb/sec
After  + tc accept: 1419 1424 1418 1419 1422 1420 Mb/sec
Before + tc drop:   1620 1619 1619 1619 1620 1620 Mb/sec
After  + tc drop:   1616 1624 1625 1624 1622 1619 Mb/sec

When netfilter egress classifying is enabled on at least one interface,
a minimal performance penalty is incurred for every egress packet, even
if the interface it's transmitted over doesn't have any netfilter egress
rules configured.  That is caused by checking dev->nf_hooks_egress
against NULL.

Measurements were performed on a Core i7-3615QM.  Commands to reproduce:
ip link add dev foo type dummy
ip link set dev foo up
modprobe pktgen
echo "add_device foo" > /proc/net/pktgen/kpktgend_3
samples/pktgen/pktgen_bench_xmit_mode_queue_xmit.sh -i foo -n 400000000 -m "11:11:11:11:11:11" -d 1.1.1.1

Accept all traffic with tc:
tc qdisc add dev foo clsact
tc filter add dev foo egress bpf da bytecode '1,6 0 0 0,'

Drop all traffic with tc:
tc qdisc add dev foo clsact
tc filter add dev foo egress bpf da bytecode '1,6 0 0 2,'

Apply this patch when measuring packet drops to avoid errors in dmesg:
https://lore.kernel.org/netdev/a73dda33-57f4-95d8-ea51-ed483abd6a7a@iogearbox.net/Signed-off-by: NLukas Wunner <lukas@wunner.de>
Cc: Laura García Liébana <nevola@gmail.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Thomas Graf <tgraf@suug.ch>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

42df6e1d

openeuler / Kernel 接近 2 年 前同步成功

openeuler / Kernel
接近 2 年前同步成功