1. 12 8月, 2021 2 次提交
    • V
      net: dsa: tag_8021q: don't broadcast during setup/teardown · 724395f4
      Vladimir Oltean 提交于
      Currently, on my board with multiple sja1105 switches in disjoint trees
      described in commit f66a6a69 ("net: dsa: permit cross-chip bridging
      between all trees in the system"), rebooting the board triggers the
      following benign warnings:
      
      [   12.345566] sja1105 spi2.0: port 0 failed to notify tag_8021q VLAN 1088 deletion: -ENOENT
      [   12.353804] sja1105 spi2.0: port 0 failed to notify tag_8021q VLAN 2112 deletion: -ENOENT
      [   12.362019] sja1105 spi2.0: port 1 failed to notify tag_8021q VLAN 1089 deletion: -ENOENT
      [   12.370246] sja1105 spi2.0: port 1 failed to notify tag_8021q VLAN 2113 deletion: -ENOENT
      [   12.378466] sja1105 spi2.0: port 2 failed to notify tag_8021q VLAN 1090 deletion: -ENOENT
      [   12.386683] sja1105 spi2.0: port 2 failed to notify tag_8021q VLAN 2114 deletion: -ENOENT
      
      Basically switch 1 calls dsa_tag_8021q_unregister, and switch 1's TX and
      RX VLANs cannot be found on switch 2's CPU port.
      
      But why would switch 2 even attempt to delete switch 1's TX and RX
      tag_8021q VLANs from its CPU port? Well, because we use dsa_broadcast,
      and it is supposed that it had added those VLANs in the first place
      (because in dsa_port_tag_8021q_vlan_match, all CPU ports match
      regardless of their tree index or switch index).
      
      The two trees probe asynchronously, and when switch 1 probed, it called
      dsa_broadcast which did not notify the tree of switch 2, because that
      didn't probe yet. But during unbind, switch 2's tree _is_ probed, so it
      _is_ notified of the deletion.
      
      Before jumping to introduce a synchronization mechanism between the
      probing across disjoint switch trees, let's take a step back and see
      whether we _need_ to do that in the first place.
      
      The RX and TX VLANs of switch 1 would be needed on switch 2's CPU port
      only if switch 1 and 2 were part of a cross-chip bridge. And
      dsa_tag_8021q_bridge_join takes care precisely of that (but if probing
      was synchronous, the bridge_join would just end up bumping the VLANs'
      refcount, because they are already installed by the setup path).
      
      Since by the time the ports are bridged, all DSA trees are already set
      up, and we don't need the tag_8021q VLANs of one switch installed on the
      other switches during probe time, the answer is that we don't need to
      fix the synchronization issue.
      
      So make the setup and teardown code paths call dsa_port_notify, which
      notifies only the local tree, and the bridge code paths call
      dsa_broadcast, which let the other trees know as well.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      724395f4
    • V
      net: dsa: print more information when a cross-chip notifier fails · ab97462b
      Vladimir Oltean 提交于
      Currently this error message does not say a lot:
      
      [   32.693498] DSA: failed to notify tag_8021q VLAN deletion: -ENOENT
      [   32.699725] DSA: failed to notify tag_8021q VLAN deletion: -ENOENT
      [   32.705931] DSA: failed to notify tag_8021q VLAN deletion: -ENOENT
      [   32.712139] DSA: failed to notify tag_8021q VLAN deletion: -ENOENT
      [   32.718347] DSA: failed to notify tag_8021q VLAN deletion: -ENOENT
      [   32.724554] DSA: failed to notify tag_8021q VLAN deletion: -ENOENT
      
      but in this form, it is immediately obvious (at least to me) what the
      problem is, even without further looking at the code:
      
      [   12.345566] sja1105 spi2.0: port 0 failed to notify tag_8021q VLAN 1088 deletion: -ENOENT
      [   12.353804] sja1105 spi2.0: port 0 failed to notify tag_8021q VLAN 2112 deletion: -ENOENT
      [   12.362019] sja1105 spi2.0: port 1 failed to notify tag_8021q VLAN 1089 deletion: -ENOENT
      [   12.370246] sja1105 spi2.0: port 1 failed to notify tag_8021q VLAN 2113 deletion: -ENOENT
      [   12.378466] sja1105 spi2.0: port 2 failed to notify tag_8021q VLAN 1090 deletion: -ENOENT
      [   12.386683] sja1105 spi2.0: port 2 failed to notify tag_8021q VLAN 2114 deletion: -ENOENT
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ab97462b
  2. 11 8月, 2021 4 次提交
  3. 09 8月, 2021 6 次提交
    • L
      devlink: Set device as early as possible · 919d13a7
      Leon Romanovsky 提交于
      All kernel devlink implementations call to devlink_alloc() during
      initialization routine for specific device which is used later as
      a parent device for devlink_register().
      
      Such late device assignment causes to the situation which requires us to
      call to device_register() before setting other parameters, but that call
      opens devlink to the world and makes accessible for the netlink users.
      
      Any attempt to move devlink_register() to be the last call generates the
      following error due to access to the devlink->dev pointer.
      
      [    8.758862]  devlink_nl_param_fill+0x2e8/0xe50
      [    8.760305]  devlink_param_notify+0x6d/0x180
      [    8.760435]  __devlink_params_register+0x2f1/0x670
      [    8.760558]  devlink_params_register+0x1e/0x20
      
      The simple change of API to set devlink device in the devlink_alloc()
      instead of devlink_register() fixes all this above and ensures that
      prior to call to devlink_register() everything already set.
      Signed-off-by: NLeon Romanovsky <leonro@nvidia.com>
      Reviewed-by: NJiri Pirko <jiri@nvidia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      919d13a7
    • V
      net: dsa: avoid fast ageing twice when port leaves a bridge · bee7c577
      Vladimir Oltean 提交于
      Drivers that support both the toggling of address learning and dynamic
      FDB flushing (mv88e6xxx, b53, sja1105) currently need to fast-age a port
      twice when it leaves a bridge:
      
      - once, when del_nbp() calls br_stp_disable_port() which puts the port
        in the BLOCKING state
      - twice, when dsa_port_switchdev_unsync_attrs() calls
        dsa_port_clear_brport_flags() which disables address learning
      
      The knee-jerk reaction might be to say "dsa_port_clear_brport_flags does
      not need to fast-age the port at all", but the thing is, we still need
      both code paths to flush the dynamic FDB entries in different situations.
      When a DSA switch port leaves a bonding/team interface that is (still) a
      bridge port, no del_nbp() will be called, so we rely on
      dsa_port_clear_brport_flags() function to restore proper standalone port
      functionality with address learning disabled.
      
      So the solution is just to avoid double the work when both code paths
      are called in series. Luckily, DSA already caches the STP port state, so
      we can skip flushing the dynamic FDB when we disable address learning
      and the STP state is one where no address learning takes place at all.
      Under that condition, not flushing the FDB is safe because there is
      supposed to not be any dynamic FDB entry at all (they were flushed
      during the transition towards that state, and none were learned in the
      meanwhile).
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bee7c577
    • V
      net: dsa: still fast-age ports joining a bridge if they can't configure learning · a4ffe09f
      Vladimir Oltean 提交于
      Commit 39f32101 ("net: dsa: don't fast age standalone ports")
      assumed that all standalone ports disable address learning, but if the
      switch driver implements .port_fast_age but not .port_bridge_flags (like
      ksz9477, ksz8795, lantiq_gswip, lan9303), then that might not actually
      be true.
      
      So whereas before, the bridge temporarily walking us through the
      BLOCKING STP state meant that the standalone ports had a checkpoint to
      flush their baggage and start fresh when they join a bridge, after that
      commit they no longer do.
      
      Restore the old behavior for these drivers by checking if the switch can
      toggle address learning. If it can't, disregard the "do_fast_age"
      argument and unconditionally perform fast ageing on STP state changes.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a4ffe09f
    • V
      net: dsa: flush the dynamic FDB of the software bridge when fast ageing a port · 9264e4ad
      Vladimir Oltean 提交于
      Currently, when DSA performs fast ageing on a port, 'bridge fdb' shows
      us that the 'self' entries (corresponding to the hardware bridge, as
      printed by dsa_slave_fdb_dump) are deleted, but the 'master' entries
      (corresponding to the software bridge) aren't.
      
      Indeed, searching through the bridge driver, neither the
      brport_attr_learning handler nor the IFLA_BRPORT_LEARNING handler call
      br_fdb_delete_by_port. However, br_stp_disable_port does, which is one
      of the paths which DSA uses to trigger a fast ageing process anyway.
      
      There is, however, one other very promising caller of
      br_fdb_delete_by_port, and that is the bridge driver's handler of the
      SWITCHDEV_FDB_FLUSH_TO_BRIDGE atomic notifier. Currently the s390/qeth
      HiperSockets card driver is the only user of this.
      
      I can't say I understand that driver's architecture or interaction with
      the bridge, but it appears to not be a switchdev driver in the traditional
      sense of the word. Nonetheless, the mechanism it provides is a useful
      way for DSA to express the fact that it performs fast ageing too, in a
      way that does not change the existing behavior for other drivers.
      
      Cc: Alexandra Winter <wintera@linux.ibm.com>
      Cc: Julian Wiedmann <jwi@linux.ibm.com>
      Cc: Roopa Prabhu <roopa@nvidia.com>
      Cc: Nikolay Aleksandrov <nikolay@nvidia.com>
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9264e4ad
    • V
      net: dsa: don't fast age bridge ports with learning turned off · 4eab90d9
      Vladimir Oltean 提交于
      On topology changes, stations that were dynamically learned on ports
      that are no longer part of the active topology must be flushed - this is
      described by clause "17.11 Updating learned station location information"
      of IEEE 802.1D-2004.
      
      However, when address learning on the bridge port is turned off in the
      first place, there is nothing to flush, so skip a potentially expensive
      operation.
      
      We can finally do this now since DSA is aware of the learning state of
      its bridged ports.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4eab90d9
    • V
      net: dsa: centralize fast ageing when address learning is turned off · 045c45d1
      Vladimir Oltean 提交于
      Currently DSA leaves it down to device drivers to fast age the FDB on a
      port when address learning is disabled on it. There are 2 reasons for
      doing that in the first place:
      
      - when address learning is disabled by user space, through
        IFLA_BRPORT_LEARNING or the brport_attr_learning sysfs, what user
        space typically wants to achieve is to operate in a mode with no
        dynamic FDB entry on that port. But if the port is already up, some
        addresses might have been already learned on it, and it seems silly to
        wait for 5 minutes for them to expire until something useful can be
        done.
      
      - when a port leaves a bridge and becomes standalone, DSA turns off
        address learning on it. This also has the nice side effect of flushing
        the dynamically learned bridge FDB entries on it, which is a good idea
        because standalone ports should not have bridge FDB entries on them.
      
      We let drivers manage fast ageing under this condition because if DSA
      were to do it, it would need to track each port's learning state, and
      act upon the transition, which it currently doesn't.
      
      But there are 2 reasons why doing it is better after all:
      
      - drivers might get it wrong and not do it (see b53_port_set_learning)
      
      - we would like to flush the dynamic entries from the software bridge
        too, and letting drivers do that would be another pain point
      
      So track the port learning state and trigger a fast age process
      automatically within DSA.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      045c45d1
  4. 08 8月, 2021 1 次提交
    • V
      net: dsa: don't fast age standalone ports · 39f32101
      Vladimir Oltean 提交于
      DSA drives the procedure to flush dynamic FDB entries from a port based
      on the change of STP state: whenever we go from a state where address
      learning is enabled (LEARNING, FORWARDING) to a state where it isn't
      (LISTENING, BLOCKING, DISABLED), we need to flush the existing dynamic
      entries.
      
      However, there are cases when this is not needed. Internally, when a
      DSA switch interface is not under a bridge, DSA still keeps it in the
      "FORWARDING" STP state. And when that interface joins a bridge, the
      bridge will meticulously iterate that port through all STP states,
      starting with BLOCKING and ending with FORWARDING. Because there is a
      state transition from the standalone version of FORWARDING into the
      temporary BLOCKING bridge port state, DSA calls the fast age procedure.
      
      Since commit 5e38c158 ("net: dsa: configure better brport flags when
      ports leave the bridge"), DSA asks standalone ports to disable address
      learning. Therefore, there can be no dynamic FDB entries on a standalone
      port. Therefore, it does not make sense to flush dynamic FDB entries on
      one.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      39f32101
  5. 06 8月, 2021 2 次提交
    • V
      net: dsa: don't disable multicast flooding to the CPU even without an IGMP querier · c73c5708
      Vladimir Oltean 提交于
      Commit 08cc83cc ("net: dsa: add support for BRIDGE_MROUTER
      attribute") added an option for users to turn off multicast flooding
      towards the CPU if they turn off the IGMP querier on a bridge which
      already has enslaved ports (echo 0 > /sys/class/net/br0/bridge/multicast_router).
      
      And commit a8b659e7 ("net: dsa: act as passthrough for bridge port flags")
      simply papered over that issue, because it moved the decision to flood
      the CPU with multicast (or not) from the DSA core down to individual drivers,
      instead of taking a more radical position then.
      
      The truth is that disabling multicast flooding to the CPU is simply
      something we are not prepared to do now, if at all. Some reasons:
      
      - ICMP6 neighbor solicitation messages are unregistered multicast
        packets as far as the bridge is concerned. So if we stop flooding
        multicast, the outside world cannot ping the bridge device's IPv6
        link-local address.
      
      - There might be foreign interfaces bridged with our DSA switch ports
        (sending a packet towards the host does not necessarily equal
        termination, but maybe software forwarding). So if there is no one
        interested in that multicast traffic in the local network stack, that
        doesn't mean nobody is.
      
      - PTP over L4 (IPv4, IPv6) is multicast, but is unregistered as far as
        the bridge is concerned. This should reach the CPU port.
      
      - The switch driver might not do FDB partitioning. And since we don't
        even bother to do more fine-grained flood disabling (such as "disable
        flooding _from_port_N_ towards the CPU port" as opposed to "disable
        flooding _from_any_port_ towards the CPU port"), this breaks standalone
        ports, or even multiple bridges where one has an IGMP querier and one
        doesn't.
      
      Reverting the logic makes all of the above work.
      
      Fixes: a8b659e7 ("net: dsa: act as passthrough for bridge port flags")
      Fixes: 08cc83cc ("net: dsa: add support for BRIDGE_MROUTER attribute")
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c73c5708
    • V
      net: dsa: stop syncing the bridge mcast_router attribute at join time · 7df4e744
      Vladimir Oltean 提交于
      Qingfang points out that when a bridge with the default settings is
      created and a port joins it:
      
      ip link add br0 type bridge
      ip link set swp0 master br0
      
      DSA calls br_multicast_router() on the bridge to see if the br0 device
      is a multicast router port, and if it is, it enables multicast flooding
      to the CPU port, otherwise it disables it.
      
      If we look through the multicast_router_show() sysfs or at the
      IFLA_BR_MCAST_ROUTER netlink attribute, we see that the default mrouter
      attribute for the bridge device is "1" (MDB_RTR_TYPE_TEMP_QUERY).
      
      However, br_multicast_router() will return "0" (MDB_RTR_TYPE_DISABLED),
      because an mrouter port in the MDB_RTR_TYPE_TEMP_QUERY state may not be
      actually _active_ until it receives an actual IGMP query. So, the
      br_multicast_router() function should really have been called
      br_multicast_router_active() perhaps.
      
      When/if an IGMP query is received, the bridge device will transition via
      br_multicast_mark_router() into the active state until the
      ip4_mc_router_timer expires after an multicast_querier_interval.
      
      Of course, this does not happen if the bridge is created with an
      mcast_router attribute of "2" (MDB_RTR_TYPE_PERM).
      
      The point is that in lack of any IGMP query messages, and in the default
      bridge configuration, unregistered multicast packets will not be able to
      reach the CPU port through flooding, and this breaks many use cases
      (most obviously, IPv6 ND, with its ICMP6 neighbor solicitation multicast
      messages).
      
      Leave the multicast flooding setting towards the CPU port down to a driver
      level decision.
      
      Fixes: 010e269f ("net: dsa: sync up switchdev objects and port attributes when joining the bridge")
      Reported-by: NDENG Qingfang <dqfext@gmail.com>
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7df4e744
  6. 05 8月, 2021 3 次提交
    • V
      net: dsa: tag_sja1105: optionally build as module when switch driver is module if PTP is enabled · f8b17a0b
      Vladimir Oltean 提交于
      TX timestamps are sent by SJA1110 as Ethernet packets containing
      metadata, so they are received by the tagging driver but must be
      processed by the switch driver - the one that is stateful since it
      keeps the TX timestamp queue.
      
      This means that there is an sja1110_process_meta_tstamp() symbol
      exported by the switch driver which is called by the tagging driver.
      
      There is a shim definition for that function when the switch driver is
      not compiled, which does nothing, but that shim is not effective when
      the tagging protocol driver is built-in and the switch driver is a
      module, because built-in code cannot call symbols exported by modules.
      
      So add an optional dependency between the tagger and the switch driver,
      if PTP support is enabled in the switch driver. If PTP is not enabled,
      sja1110_process_meta_tstamp() will translate into the shim "do nothing
      with these meta frames" function.
      
      Fixes: 566b18c8 ("net: dsa: sja1105: implement TX timestamping for SJA1110")
      Reported-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f8b17a0b
    • V
      net: dsa: give preference to local CPU ports · 2c0b0325
      Vladimir Oltean 提交于
      Be there an "H" switch topology, where there are 2 switches connected as
      follows:
      
               eth0                                                     eth1
                |                                                        |
             CPU port                                                CPU port
                |                        DSA link                        |
       sw0p0  sw0p1  sw0p2  sw0p3  sw0p4 -------- sw1p4  sw1p3  sw1p2  sw1p1  sw1p0
         |             |      |                            |      |             |
       user          user   user                         user   user          user
       port          port   port                         port   port          port
      
      basically one where each switch has its own CPU port for termination,
      but there is also a DSA link in case packets need to be forwarded in
      hardware between one switch and another.
      
      DSA insists to see this as a daisy chain topology, basically registering
      all network interfaces as sw0p0@eth0, ... sw1p0@eth0 and disregarding
      eth1 as a valid DSA master.
      
      This is only half the story, since when asked using dsa_port_is_cpu(),
      DSA will respond that sw1p1 is a CPU port, however one which has no
      dp->cpu_dp pointing to it. So sw1p1 is enabled, but not used.
      
      Furthermore, be there a driver for switches which support only one
      upstream port. This driver iterates through its ports and checks using
      dsa_is_upstream_port() whether the current port is an upstream one.
      For switch 1, two ports pass the "is upstream port" checks:
      
      - sw1p4 is an upstream port because it is a routing port towards the
        dedicated CPU port assigned using dsa_tree_setup_default_cpu()
      
      - sw1p1 is also an upstream port because it is a CPU port, albeit one
        that is disabled. This is because dsa_upstream_port() returns:
      
      	if (!cpu_dp)
      		return port;
      
        which means that if @dp does not have a ->cpu_dp pointer (which is a
        characteristic of CPU ports themselves as well as unused ports), then
        @dp is its own upstream port.
      
      So the driver for switch 1 rightfully says: I have two upstream ports,
      but I don't support multiple upstream ports! So let me error out, I
      don't know which one to choose and what to do with the other one.
      
      Generally I am against enforcing any default policy in the kernel in
      terms of user to CPU port assignment (like round robin or such) but this
      case is different. To solve the conundrum, one would have to:
      
      - Disable sw1p1 in the device tree or mark it as "not a CPU port" in
        order to comply with DSA's view of this topology as a daisy chain,
        where the termination traffic from switch 1 must pass through switch 0.
        This is counter-productive because it wastes 1Gbps of termination
        throughput in switch 1.
      - Disable the DSA link between sw0p4 and sw1p4 and do software
        forwarding between switch 0 and 1, and basically treat the switches as
        part of disjoint switch trees. This is counter-productive because it
        wastes 1Gbps of autonomous forwarding throughput between switch 0 and 1.
      - Treat sw0p4 and sw1p4 as user ports instead of DSA links. This could
        work, but it makes cross-chip bridging impossible. In this setup we
        would need to have 2 separate bridges, br0 spanning the ports of
        switch 0, and br1 spanning the ports of switch 1, and the "DSA links
        treated as user ports" sw0p4 (part of br0) and sw1p4 (part of br1) are
        the gateway ports between one bridge and another. This is hard to
        manage from a user's perspective, who wants to have a unified view of
        the switching fabric and the ability to transparently add ports to the
        same bridge. VLANs would also need to be explicitly managed by the
        user on these gateway ports.
      
      So it seems that the only reasonable thing to do is to make DSA prefer
      CPU ports that are local to the switch. Meaning that by default, the
      user and DSA ports of switch 0 will get assigned to the CPU port from
      switch 0 (sw0p1) and the user and DSA ports of switch 1 will get
      assigned to the CPU port from switch 1.
      
      The way this solves the problem is that sw1p4 is no longer an upstream
      port as far as switch 1 is concerned (it no longer views sw0p1 as its
      dedicated CPU port).
      
      So here we are, the first multi-CPU port that DSA supports is also
      perhaps the most uneventful one: the individual switches don't support
      multiple CPUs, however the DSA switch tree as a whole does have multiple
      CPU ports. No user space assignment of user ports to CPU ports is
      desirable, necessary, or possible.
      
      Ports that do not have a local CPU port (say there was an extra switch
      hanging off of sw0p0) default to the standard implementation of getting
      assigned to the first CPU port of the DSA switch tree. Is that good
      enough? Probably not (if the downstream switch was hanging off of switch
      1, we would most certainly prefer its CPU port to be sw1p1), but in
      order to support that use case too, we would need to traverse the
      dst->rtable in search of an optimum dedicated CPU port, one that has the
      smallest number of hops between dp->ds and dp->cpu_dp->ds. At the
      moment, the DSA routing table structure does not keep the number of hops
      between dl->dp and dl->link_dp, and while it is probably deducible,
      there is zero justification to write that code now. Let's hope DSA will
      never have to support that use case.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2c0b0325
    • V
      net: dsa: rename teardown_default_cpu to teardown_cpu_ports · 0e8eb9a1
      Vladimir Oltean 提交于
      There is nothing specific to having a default CPU port to what
      dsa_tree_teardown_default_cpu() does. Even with multiple CPU ports,
      it would do the same thing: iterate through the ports of this switch
      tree and reset the ->cpu_dp pointer to NULL. So rename it accordingly.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0e8eb9a1
  7. 04 8月, 2021 1 次提交
    • V
      net: dsa: tag_sja1105: consistently fail with arbitrary input · 421297ef
      Vladimir Oltean 提交于
      Dan Carpenter's smatch tests report that the "vid" variable, populated
      by sja1105_vlan_rcv when an skb is received by the tagger that has a
      VLAN ID which cannot be decoded by tag_8021q, may be uninitialized when
      used here:
      
      	if (source_port == -1 || switch_id == -1)
      		skb->dev = dsa_find_designated_bridge_port_by_vid(netdev, vid);
      
      The sja1105 driver, by construction, sets up the switch in a way that
      all data plane packets sent towards the CPU port are VLAN-tagged. So it
      is practically impossible, in a functional system, for a packet to be
      processed by sja1110_rcv() which is not a control packet and does not
      have a VLAN header either.
      
      However, it would be nice if the sja1105 tagging driver could
      consistently do something valid, for example fail, even if presented with
      packets that do not hold valid sja1105 tags. Currently it is a bit hard
      to argue that it does that, given the fact that a data plane packet with
      no VLAN tag will trigger a call to dsa_find_designated_bridge_port_by_vid
      with a vid argument that is an uninitialized stack variable.
      
      To fix this, we can initialize the u16 vid variable with 0, a value that
      can never be a bridge VLAN, so dsa_find_designated_bridge_port_by_vid
      will always return a NULL skb->dev.
      Reported-by: Nkernel test robot <lkp@intel.com>
      Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Link: https://lore.kernel.org/r/20210802195137.303625-1-vladimir.oltean@nxp.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      421297ef
  8. 02 8月, 2021 1 次提交
  9. 30 7月, 2021 1 次提交
    • V
      net: dsa: don't set skb->offload_fwd_mark when not offloading the bridge · bea79078
      Vladimir Oltean 提交于
      DSA has gained the recent ability to deal gracefully with upper
      interfaces it cannot offload, such as the bridge, bonding or team
      drivers. When such uppers exist, the ports are still in standalone mode
      as far as the hardware is concerned.
      
      But when we deliver packets to the software bridge in order for that to
      do the forwarding, there is an unpleasant surprise in that the bridge
      will refuse to forward them. This is because we unconditionally set
      skb->offload_fwd_mark = true, meaning that the bridge thinks the frames
      were already forwarded in hardware by us.
      
      Since dp->bridge_dev is populated only when there is hardware offload
      for it, but not in the software fallback case, let's introduce a new
      helper that can be called from the tagger data path which sets the
      skb->offload_fwd_mark accordingly to zero when there is no hardware
      offload for bridging. This lets the bridge forward packets back to other
      interfaces of our switch, if needed.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NTobias Waldekranz <tobias@waldekranz.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bea79078
  10. 29 7月, 2021 1 次提交
    • V
      net: dsa: tag_sja1105: fix control packets on SJA1110 being received on an imprecise port · 04a17583
      Vladimir Oltean 提交于
      On RX, a control packet with SJA1110 will have:
      - an in-band control extension (DSA tag) composed of a header and an
        optional trailer (if it is a timestamp frame). We can (and do) deduce
        the source port and switch id from this.
      - a VLAN header, which can either be the tag_8021q RX VLAN (pvid) or the
        bridge VLAN. The sja1105_vlan_rcv() function attempts to deduce the
        source port and switch id a second time from this.
      
      The basic idea is that even though we don't need the source port
      information from the tag_8021q header if it's a control packet, we do
      need to strip that header before we pass it on to the network stack.
      
      The problem is that we call sja1105_vlan_rcv for ports under VLAN-aware
      bridges, and that function tells us it couldn't identify a tag_8021q
      header, so we need to perform imprecise RX by VID. Well, we don't,
      because we already know the source port and switch ID.
      
      This patch drops the return value from sja1105_vlan_rcv and we just look
      at the source_port and switch_id values from sja1105_rcv and sja1110_rcv
      which were initialized to -1. If they are still -1 it means we need to
      perform imprecise RX.
      
      Fixes: 884be12f ("net: dsa: sja1105: add support for imprecise RX")
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      04a17583
  11. 28 7月, 2021 1 次提交
    • A
      dev_ioctl: split out ndo_eth_ioctl · a7605370
      Arnd Bergmann 提交于
      Most users of ndo_do_ioctl are ethernet drivers that implement
      the MII commands SIOCGMIIPHY/SIOCGMIIREG/SIOCSMIIREG, or hardware
      timestamping with SIOCSHWTSTAMP/SIOCGHWTSTAMP.
      
      Separate these from the few drivers that use ndo_do_ioctl to
      implement SIOCBOND, SIOCBR and SIOCWANDEV commands.
      
      This is a purely cosmetic change intended to help readers find
      their way through the implementation.
      
      Cc: Doug Ledford <dledford@redhat.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jay Vosburgh <j.vosburgh@gmail.com>
      Cc: Veaceslav Falico <vfalico@gmail.com>
      Cc: Andy Gospodarek <andy@greyhouse.net>
      Cc: Andrew Lunn <andrew@lunn.ch>
      Cc: Vivien Didelot <vivien.didelot@gmail.com>
      Cc: Florian Fainelli <f.fainelli@gmail.com>
      Cc: Vladimir Oltean <olteanv@gmail.com>
      Cc: Leon Romanovsky <leon@kernel.org>
      Cc: linux-rdma@vger.kernel.org
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Acked-by: NJason Gunthorpe <jgg@nvidia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a7605370
  12. 27 7月, 2021 3 次提交
    • V
      Revert "net: dsa: Allow drivers to filter packets they can decode source port from" · edac6f63
      Vladimir Oltean 提交于
      This reverts commit cc1939e4.
      
      Currently 2 classes of DSA drivers are able to send/receive packets
      directly through the DSA master:
      - drivers with DSA_TAG_PROTO_NONE
      - sja1105
      
      Now that sja1105 has gained the ability to perform traffic termination
      even under the tricky case (VLAN-aware bridge), and that is much more
      functional (we can perform VLAN-aware bridging with foreign interfaces),
      there is no reason to keep this code in the receive path of the network
      core. So delete it.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      edac6f63
    • V
      net: dsa: sja1105: add bridge TX data plane offload based on tag_8021q · b6ad86e6
      Vladimir Oltean 提交于
      The main desire for having this feature in sja1105 is to support network
      stack termination for traffic coming from a VLAN-aware bridge.
      
      For sja1105, offloading the bridge data plane means sending packets
      as-is, with the proper VLAN tag, to the chip. The chip will look up its
      FDB and forward them to the correct destination port.
      
      But we support bridge data plane offload even for VLAN-unaware bridges,
      and the implementation there is different. In fact, VLAN-unaware
      bridging is governed by tag_8021q, so it makes sense to have the
      .bridge_fwd_offload_add() implementation fully within tag_8021q.
      The key difference is that we only support 1 VLAN-aware bridge, but we
      support multiple VLAN-unaware bridges. So we need to make sure that the
      forwarding domain is not crossed by packets injected from the stack.
      
      For this, we introduce the concept of a tag_8021q TX VLAN for bridge
      forwarding offload. As opposed to the regular TX VLANs which contain
      only 2 ports (the user port and the CPU port), a bridge data plane TX
      VLAN is "multicast" (or "imprecise"): it contains all the ports that are
      part of a certain bridge, and the hardware will select where the packet
      goes within this "imprecise" forwarding domain.
      
      Each VLAN-unaware bridge has its own "imprecise" TX VLAN, so we make use
      of the unique "bridge_num" provided by DSA for the data plane offload.
      We use the same 3 bits from the tag_8021q VLAN ID format to encode this
      bridge number.
      
      Note that these 3 bit positions have been used before for sub-VLANs in
      best-effort VLAN filtering mode. The difference is that for best-effort,
      the sub-VLANs were only valid on RX (and it was documented that the
      sub-VLAN field needed to be transmitted as zero). Whereas for the bridge
      data plane offload, these 3 bits are only valid on TX.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b6ad86e6
    • V
      net: dsa: sja1105: add support for imprecise RX · 884be12f
      Vladimir Oltean 提交于
      This is already common knowledge by now, but the sja1105 does not have
      hardware support for DSA tagging for data plane packets, and tag_8021q
      sets up a unique pvid per port, transmitted as VLAN-tagged towards the
      CPU, for the source port to be decoded nonetheless.
      
      When the port is part of a VLAN-aware bridge, the pvid committed to
      hardware is taken from the bridge and not from tag_8021q, so we need to
      work with that the best we can.
      
      Configure the switches to send all packets to the CPU as VLAN-tagged
      (even ones that were originally untagged on the wire) and make use of
      dsa_untag_bridge_pvid() to get rid of it before we send those packets up
      the network stack.
      
      With the classified VLAN used by hardware known to the tagger, we first
      peek at the VID in an attempt to figure out if the packet was received
      from a VLAN-unaware port (standalone or under a VLAN-unaware bridge),
      case in which we can continue to call dsa_8021q_rcv(). If that is not
      the case, the packet probably came from a VLAN-aware bridge. So we call
      the DSA helper that finds for us a "designated bridge port" - one that
      is a member of the VLAN ID from the packet, and is in the proper STP
      state - basically these are all checks performed by br_handle_frame() in
      the software RX data path.
      
      The bridge will accept the packet as valid even if the source port was
      maybe wrong. So it will maybe learn the MAC SA of the packet on the
      wrong port, and its software FDB will be out of sync with the hardware
      FDB. So replies towards this same MAC DA will not work, because the
      bridge will send towards a different netdev.
      
      This is where the bridge data plane offload ("imprecise TX") added by
      the next patch comes in handy. The software FDB is wrong, true, but the
      hardware FDB isn't, and by offloading the bridge forwarding plane we
      have a chance to right a wrong, and have the hardware look up the FDB
      for us for the reply packet. So it all cancels out.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      884be12f
  13. 23 7月, 2021 4 次提交
    • T
      net: dsa: tag_dsa: offload the bridge forwarding process · d82f8ab0
      Tobias Waldekranz 提交于
      Allow the DSA tagger to generate FORWARD frames for offloaded skbs
      sent from a bridge that we offload, allowing the switch to handle any
      frame replication that may be required. This also means that source
      address learning takes place on packets sent from the CPU, meaning
      that return traffic no longer needs to be flooded as unknown unicast.
      Signed-off-by: NTobias Waldekranz <tobias@waldekranz.com>
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d82f8ab0
    • V
      net: dsa: add support for bridge TX forwarding offload · 123abc06
      Vladimir Oltean 提交于
      For a DSA switch, to offload the forwarding process of a bridge device
      means to send the packets coming from the software bridge as data plane
      packets. This is contrary to everything that DSA has done so far,
      because the current taggers only know to send control packets (ones that
      target a specific destination port), whereas data plane packets are
      supposed to be forwarded according to the FDB lookup, much like packets
      ingressing on any regular ingress port. If the FDB lookup process
      returns multiple destination ports (flooding, multicast), then
      replication is also handled by the switch hardware - the bridge only
      sends a single packet and avoids the skb_clone().
      
      DSA keeps for each bridge port a zero-based index (the number of the
      bridge). Multiple ports performing TX forwarding offload to the same
      bridge have the same dp->bridge_num value, and ports not offloading the
      TX data plane of a bridge have dp->bridge_num = -1.
      
      The tagger can check if the packet that is being transmitted on has
      skb->offload_fwd_mark = true or not. If it does, it can be sure that the
      packet belongs to the data plane of a bridge, further information about
      which can be obtained based on dp->bridge_dev and dp->bridge_num.
      It can then compose a DSA tag for injecting a data plane packet into
      that bridge number.
      
      For the switch driver side, we offer two new dsa_switch_ops methods,
      called .port_bridge_fwd_offload_{add,del}, which are modeled after
      .port_bridge_{join,leave}.
      These methods are provided in case the driver needs to configure the
      hardware to treat packets coming from that bridge software interface as
      data plane packets. The switchdev <-> bridge interaction happens during
      the netdev_master_upper_dev_link() call, so to switch drivers, the
      effect is that the .port_bridge_fwd_offload_add() method is called
      immediately after .port_bridge_join().
      
      If the bridge number exceeds the number of bridges for which the switch
      driver can offload the TX data plane (and this includes the case where
      the driver can offload none), DSA falls back to simply returning
      tx_fwd_offload = false in the switchdev_bridge_port_offload() call.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      123abc06
    • V
      net: dsa: track the number of switches in a tree · 5b22d366
      Vladimir Oltean 提交于
      In preparation of supporting data plane forwarding on behalf of a
      software bridge, some drivers might need to view bridges as virtual
      switches behind the CPU port in a cross-chip topology.
      
      Give them some help and let them know how many physical switches there
      are in the tree, so that they can count the virtual switches starting
      from that number on.
      
      Note that the first dsa_switch_ops method where this information is
      reliably available is .setup(). This is because of how DSA works:
      in a tree with 3 switches, each calling dsa_register_switch(), the first
      2 will advance until dsa_tree_setup() -> dsa_tree_setup_routing_table()
      and exit with error code 0 because the topology is not complete. Since
      probing is parallel at this point, one switch does not know about the
      existence of the other. Then the third switch comes, and for it,
      dsa_tree_setup_routing_table() returns complete = true. This switch goes
      ahead and calls dsa_tree_setup_switches() for everybody else, calling
      their .setup() methods too. This acts as the synchronization point.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5b22d366
    • T
      net: bridge: switchdev: allow the TX data plane forwarding to be offloaded · 47211192
      Tobias Waldekranz 提交于
      Allow switchdevs to forward frames from the CPU in accordance with the
      bridge configuration in the same way as is done between bridge
      ports. This means that the bridge will only send a single skb towards
      one of the ports under the switchdev's control, and expects the driver
      to deliver the packet to all eligible ports in its domain.
      
      Primarily this improves the performance of multicast flows with
      multiple subscribers, as it allows the hardware to perform the frame
      replication.
      
      The basic flow between the driver and the bridge is as follows:
      
      - When joining a bridge port, the switchdev driver calls
        switchdev_bridge_port_offload() with tx_fwd_offload = true.
      
      - The bridge sends offloadable skbs to one of the ports under the
        switchdev's control using skb->offload_fwd_mark = true.
      
      - The switchdev driver checks the skb->offload_fwd_mark field and lets
        its FDB lookup select the destination port mask for this packet.
      
      v1->v2:
      - convert br_input_skb_cb::fwd_hwdoms to a plain unsigned long
      - introduce a static key "br_switchdev_fwd_offload_used" to minimize the
        impact of the newly introduced feature on all the setups which don't
        have hardware that can make use of it
      - introduce a check for nbp->flags & BR_FWD_OFFLOAD to optimize cache
        line access
      - reorder nbp_switchdev_frame_mark_accel() and br_handle_vlan() in
        __br_forward()
      - do not strip VLAN on egress if forwarding offload on VLAN-aware bridge
        is being used
      - propagate errors from .ndo_dfwd_add_station() if not EOPNOTSUPP
      
      v2->v3:
      - replace the solution based on .ndo_dfwd_add_station with a solution
        based on switchdev_bridge_port_offload
      - rename BR_FWD_OFFLOAD to BR_TX_FWD_OFFLOAD
      v3->v4: rebase
      v4->v5:
      - make sure the static key is decremented on bridge port unoffload
      - more function and variable renaming and comments for them:
        br_switchdev_fwd_offload_used to br_switchdev_tx_fwd_offload
        br_switchdev_accels_skb to br_switchdev_frame_uses_tx_fwd_offload
        nbp_switchdev_frame_mark_tx_fwd to nbp_switchdev_frame_mark_tx_fwd_to_hwdom
        nbp_switchdev_frame_mark_accel to nbp_switchdev_frame_mark_tx_fwd_offload
        fwd_accel to tx_fwd_offload
      Signed-off-by: NTobias Waldekranz <tobias@waldekranz.com>
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      47211192
  14. 22 7月, 2021 4 次提交
    • V
      net: bridge: move the switchdev object replay helpers to "push" mode · 4e51bf44
      Vladimir Oltean 提交于
      Starting with commit 4f2673b3 ("net: bridge: add helper to replay
      port and host-joined mdb entries"), DSA has introduced some bridge
      helpers that replay switchdev events (FDB/MDB/VLAN additions and
      deletions) that can be lost by the switchdev drivers in a variety of
      circumstances:
      
      - an IP multicast group was host-joined on the bridge itself before any
        switchdev port joined the bridge, leading to the host MDB entries
        missing in the hardware database.
      - during the bridge creation process, the MAC address of the bridge was
        added to the FDB as an entry pointing towards the bridge device
        itself, but with no switchdev ports being part of the bridge yet, this
        local FDB entry would remain unknown to the switchdev hardware
        database.
      - a VLAN/FDB/MDB was added to a bridge port that is a LAG interface,
        before any switchdev port joined that LAG, leading to the hardware
        database missing those entries.
      - a switchdev port left a LAG that is a bridge port, while the LAG
        remained part of the bridge, and all FDB/MDB/VLAN entries remained
        installed in the hardware database of the switchdev port.
      
      Also, since commit 0d2cfbd4 ("net: bridge: ignore switchdev events
      for LAG ports which didn't request replay"), DSA introduced a method,
      based on a const void *ctx, to ensure that two switchdev ports under the
      same LAG that is a bridge port do not see the same MDB/VLAN entry being
      replayed twice by the bridge, once for every bridge port that joins the
      LAG.
      
      With so many ordering corner cases being possible, it seems unreasonable
      to expect a switchdev driver writer to get it right from the first try.
      Therefore, now that DSA has experimented with the bridge replay helpers
      for a little bit, we can move the code to the bridge driver where it is
      more readily available to all switchdev drivers.
      
      To convert the switchdev object replay helpers from "pull mode" (where
      the driver asks for them) to a "push mode" (where the bridge offers them
      automatically), the biggest problem is that the bridge needs to be aware
      when a switchdev port joins and leaves, even when the switchdev is only
      indirectly a bridge port (for example when the bridge port is a LAG
      upper of the switchdev).
      
      Luckily, we already have a hook for that, in the form of the newly
      introduced switchdev_bridge_port_offload() and
      switchdev_bridge_port_unoffload() calls. These offer a natural place for
      hooking the object addition and deletion replays.
      
      Extend the above 2 functions with:
      - pointers to the switchdev atomic notifier (for FDB replays) and the
        blocking notifier (for MDB and VLAN replays).
      - the "const void *ctx" argument required for drivers to be able to
        disambiguate between which port is targeted, when multiple ports are
        lowers of the same LAG that is a bridge port. Most of the drivers pass
        NULL to this argument, except the ones that support LAG offload and have
        the proper context check already in place in the switchdev blocking
        notifier handler.
      
      Also unexport the replay helpers, since nobody except the bridge calls
      them directly now.
      
      Note that:
      (a) we abuse the terminology slightly, because FDB entries are not
          "switchdev objects", but we count them as objects nonetheless.
          With no direct way to prove it, I think they are not modeled as
          switchdev objects because those can only be installed by the bridge
          to the hardware (as opposed to FDB entries which can be propagated
          in the other direction too). This is merely an abuse of terms, FDB
          entries are replayed too, despite not being objects.
      (b) the bridge does not attempt to sync port attributes to newly joined
          ports, just the countable stuff (the objects). The reason for this
          is simple: no universal and symmetric way to sync and unsync them is
          known. For example, VLAN filtering: what to do on unsync, disable or
          leave it enabled? Similarly, STP state, ageing timer, etc etc. What
          a switchdev port does when it becomes standalone again is not really
          up to the bridge's competence, and the driver should deal with it.
          On the other hand, replaying deletions of switchdev objects can be
          seen a matter of cleanup and therefore be treated by the bridge,
          hence this patch.
      
      We make the replay helpers opt-in for drivers, because they might not
      bring immediate benefits for them:
      
      - nbp_vlan_init() is called _after_ netdev_master_upper_dev_link(),
        so br_vlan_replay() should not do anything for the new drivers on
        which we call it. The existing drivers where there was even a slight
        possibility for there to exist a VLAN on a bridge port before they
        join it are already guarded against this: mlxsw and prestera deny
        joining LAG interfaces that are members of a bridge.
      
      - br_fdb_replay() should now notify of local FDB entries, but I patched
        all drivers except DSA to ignore these new entries in commit
        2c4eca3e ("net: bridge: switchdev: include local flag in FDB
        notifications"). Driver authors can lift this restriction as they
        wish, and when they do, they can also opt into the FDB replay
        functionality.
      
      - br_mdb_replay() should fix a real issue which is described in commit
        4f2673b3 ("net: bridge: add helper to replay port and host-joined
        mdb entries"). However most drivers do not offload the
        SWITCHDEV_OBJ_ID_HOST_MDB to see this issue: only cpsw and am65_cpsw
        offload this switchdev object, and I don't completely understand the
        way in which they offload this switchdev object anyway. So I'll leave
        it up to these drivers' respective maintainers to opt into
        br_mdb_replay().
      
      So most of the drivers pass NULL notifier blocks for the replay helpers,
      except:
      - dpaa2-switch which was already acked/regression-tested with the
        helpers enabled (and there isn't much of a downside in having them)
      - ocelot which already had replay logic in "pull" mode
      - DSA which already had replay logic in "pull" mode
      
      An important observation is that the drivers which don't currently
      request bridge event replays don't even have the
      switchdev_bridge_port_{offload,unoffload} calls placed in proper places
      right now. This was done to avoid unnecessary rework for drivers which
      might never even add support for this. For driver writers who wish to
      add replay support, this can be used as a tentative placement guide:
      https://patchwork.kernel.org/project/netdevbpf/patch/20210720134655.892334-11-vladimir.oltean@nxp.com/
      
      Cc: Vadym Kochan <vkochan@marvell.com>
      Cc: Taras Chornyi <tchornyi@marvell.com>
      Cc: Ioana Ciornei <ioana.ciornei@nxp.com>
      Cc: Lars Povlsen <lars.povlsen@microchip.com>
      Cc: Steen Hegelund <Steen.Hegelund@microchip.com>
      Cc: UNGLinuxDriver@microchip.com
      Cc: Claudiu Manoil <claudiu.manoil@nxp.com>
      Cc: Alexandre Belloni <alexandre.belloni@bootlin.com>
      Cc: Grygorii Strashko <grygorii.strashko@ti.com>
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Acked-by: Ioana Ciornei <ioana.ciornei@nxp.com> # dpaa2-switch
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4e51bf44
    • V
      net: bridge: switchdev: let drivers inform which bridge ports are offloaded · 2f5dc00f
      Vladimir Oltean 提交于
      On reception of an skb, the bridge checks if it was marked as 'already
      forwarded in hardware' (checks if skb->offload_fwd_mark == 1), and if it
      is, it assigns the source hardware domain of that skb based on the
      hardware domain of the ingress port. Then during forwarding, it enforces
      that the egress port must have a different hardware domain than the
      ingress one (this is done in nbp_switchdev_allowed_egress).
      
      Non-switchdev drivers don't report any physical switch id (neither
      through devlink nor .ndo_get_port_parent_id), therefore the bridge
      assigns them a hardware domain of 0, and packets coming from them will
      always have skb->offload_fwd_mark = 0. So there aren't any restrictions.
      
      Problems appear due to the fact that DSA would like to perform software
      fallback for bonding and team interfaces that the physical switch cannot
      offload.
      
             +-- br0 ---+
            / /   |      \
           / /    |       \
          /  |    |      bond0
         /   |    |     /    \
       swp0 swp1 swp2 swp3 swp4
      
      There, it is desirable that the presence of swp3 and swp4 under a
      non-offloaded LAG does not preclude us from doing hardware bridging
      beteen swp0, swp1 and swp2. The bandwidth of the CPU is often times high
      enough that software bridging between {swp0,swp1,swp2} and bond0 is not
      impractical.
      
      But this creates an impossible paradox given the current way in which
      port hardware domains are assigned. When the driver receives a packet
      from swp0 (say, due to flooding), it must set skb->offload_fwd_mark to
      something.
      
      - If we set it to 0, then the bridge will forward it towards swp1, swp2
        and bond0. But the switch has already forwarded it towards swp1 and
        swp2 (not to bond0, remember, that isn't offloaded, so as far as the
        switch is concerned, ports swp3 and swp4 are not looking up the FDB,
        and the entire bond0 is a destination that is strictly behind the
        CPU). But we don't want duplicated traffic towards swp1 and swp2, so
        it's not ok to set skb->offload_fwd_mark = 0.
      
      - If we set it to 1, then the bridge will not forward the skb towards
        the ports with the same switchdev mark, i.e. not to swp1, swp2 and
        bond0. Towards swp1 and swp2 that's ok, but towards bond0? It should
        have forwarded the skb there.
      
      So the real issue is that bond0 will be assigned the same hardware
      domain as {swp0,swp1,swp2}, because the function that assigns hardware
      domains to bridge ports, nbp_switchdev_add(), recurses through bond0's
      lower interfaces until it finds something that implements devlink (calls
      dev_get_port_parent_id with bool recurse = true). This is a problem
      because the fact that bond0 can be offloaded by swp3 and swp4 in our
      example is merely an assumption.
      
      A solution is to give the bridge explicit hints as to what hardware
      domain it should use for each port.
      
      Currently, the bridging offload is very 'silent': a driver registers a
      netdevice notifier, which is put on the netns's notifier chain, and
      which sniffs around for NETDEV_CHANGEUPPER events where the upper is a
      bridge, and the lower is an interface it knows about (one registered by
      this driver, normally). Then, from within that notifier, it does a bunch
      of stuff behind the bridge's back, without the bridge necessarily
      knowing that there's somebody offloading that port. It looks like this:
      
           ip link set swp0 master br0
                        |
                        v
       br_add_if() calls netdev_master_upper_dev_link()
                        |
                        v
              call_netdevice_notifiers
                        |
                        v
             dsa_slave_netdevice_event
                        |
                        v
              oh, hey! it's for me!
                        |
                        v
                 .port_bridge_join
      
      What we do to solve the conundrum is to be less silent, and change the
      switchdev drivers to present themselves to the bridge. Something like this:
      
           ip link set swp0 master br0
                        |
                        v
       br_add_if() calls netdev_master_upper_dev_link()
                        |
                        v                    bridge: Aye! I'll use this
              call_netdevice_notifiers           ^  ppid as the
                        |                        |  hardware domain for
                        v                        |  this port, and zero
             dsa_slave_netdevice_event           |  if I got nothing.
                        |                        |
                        v                        |
              oh, hey! it's for me!              |
                        |                        |
                        v                        |
                 .port_bridge_join               |
                        |                        |
                        +------------------------+
                   switchdev_bridge_port_offload(swp0, swp0)
      
      Then stacked interfaces (like bond0 on top of swp3/swp4) would be
      treated differently in DSA, depending on whether we can or cannot
      offload them.
      
      The offload case:
      
          ip link set bond0 master br0
                        |
                        v
       br_add_if() calls netdev_master_upper_dev_link()
                        |
                        v                    bridge: Aye! I'll use this
              call_netdevice_notifiers           ^  ppid as the
                        |                        |  switchdev mark for
                        v                        |        bond0.
             dsa_slave_netdevice_event           | Coincidentally (or not),
                        |                        | bond0 and swp0, swp1, swp2
                        v                        | all have the same switchdev
              hmm, it's not quite for me,        | mark now, since the ASIC
               but my driver has already         | is able to forward towards
                 called .port_lag_join           | all these ports in hw.
                for it, because I have           |
            a port with dp->lag_dev == bond0.    |
                        |                        |
                        v                        |
                 .port_bridge_join               |
                 for swp3 and swp4               |
                        |                        |
                        +------------------------+
                  switchdev_bridge_port_offload(bond0, swp3)
                  switchdev_bridge_port_offload(bond0, swp4)
      
      And the non-offload case:
      
          ip link set bond0 master br0
                        |
                        v
       br_add_if() calls netdev_master_upper_dev_link()
                        |
                        v                    bridge waiting:
              call_netdevice_notifiers           ^  huh, switchdev_bridge_port_offload
                        |                        |  wasn't called, okay, I'll use a
                        v                        |  hwdom of zero for this one.
             dsa_slave_netdevice_event           :  Then packets received on swp0 will
                        |                        :  not be software-forwarded towards
                        v                        :  swp1, but they will towards bond0.
               it's not for me, but
             bond0 is an upper of swp3
            and swp4, but their dp->lag_dev
             is NULL because they couldn't
                  offload it.
      
      Basically we can draw the conclusion that the lowers of a bridge port
      can come and go, so depending on the configuration of lowers for a
      bridge port, it can dynamically toggle between offloaded and unoffloaded.
      Therefore, we need an equivalent switchdev_bridge_port_unoffload too.
      
      This patch changes the way any switchdev driver interacts with the
      bridge. From now on, everybody needs to call switchdev_bridge_port_offload
      and switchdev_bridge_port_unoffload, otherwise the bridge will treat the
      port as non-offloaded and allow software flooding to other ports from
      the same ASIC.
      
      Note that these functions lay the ground for a more complex handshake
      between switchdev drivers and the bridge in the future.
      
      For drivers that will request a replay of the switchdev objects when
      they offload and unoffload a bridge port (DSA, dpaa2-switch, ocelot), we
      place the call to switchdev_bridge_port_unoffload() strategically inside
      the NETDEV_PRECHANGEUPPER notifier's code path, and not inside
      NETDEV_CHANGEUPPER. This is because the switchdev object replay helpers
      need the netdev adjacency lists to be valid, and that is only true in
      NETDEV_PRECHANGEUPPER.
      
      Cc: Vadym Kochan <vkochan@marvell.com>
      Cc: Taras Chornyi <tchornyi@marvell.com>
      Cc: Ioana Ciornei <ioana.ciornei@nxp.com>
      Cc: Lars Povlsen <lars.povlsen@microchip.com>
      Cc: Steen Hegelund <Steen.Hegelund@microchip.com>
      Cc: UNGLinuxDriver@microchip.com
      Cc: Claudiu Manoil <claudiu.manoil@nxp.com>
      Cc: Alexandre Belloni <alexandre.belloni@bootlin.com>
      Cc: Grygorii Strashko <grygorii.strashko@ti.com>
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Tested-by: Ioana Ciornei <ioana.ciornei@nxp.com> # dpaa2-switch: regression
      Acked-by: Ioana Ciornei <ioana.ciornei@nxp.com> # dpaa2-switch
      Tested-by: Horatiu Vultur <horatiu.vultur@microchip.com> # ocelot-switch
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2f5dc00f
    • L
      net: dsa: tag_ksz: dont let the hardware process the layer 4 checksum · 37120f23
      Lino Sanfilippo 提交于
      If the checksum calculation is offloaded to the network device (e.g due to
      NETIF_F_HW_CSUM inherited from the DSA master device), the calculated
      layer 4 checksum is incorrect. This is since the DSA tag which is placed
      after the layer 4 data is considered as being part of the daa and thus
      errorneously included into the checksum calculation.
      To avoid this, always calculate the layer 4 checksum in software.
      Signed-off-by: NLino Sanfilippo <LinoSanfilippo@gmx.de>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      37120f23
    • L
      net: dsa: ensure linearized SKBs in case of tail taggers · 21cf377a
      Lino Sanfilippo 提交于
      The function skb_put() that is used by tail taggers to make room for the
      DSA tag must only be called for linearized SKBS. However in case that the
      slave device inherited features like NETIF_F_HW_SG or NETIF_F_FRAGLIST the
      SKB passed to the slaves transmit function may not be linearized.
      Avoid those SKBs by clearing the NETIF_F_HW_SG and NETIF_F_FRAGLIST flags
      for tail taggers.
      Furthermore since the tagging protocol can be changed at runtime move the
      code for setting up the slaves features into dsa_slave_setup_tagger().
      Suggested-by: NVladimir Oltean <olteanv@gmail.com>
      Signed-off-by: NLino Sanfilippo <LinoSanfilippo@gmx.de>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      21cf377a
  15. 20 7月, 2021 6 次提交
    • V
      net: dsa: use switchdev_handle_fdb_{add,del}_to_device · b94dc99c
      Vladimir Oltean 提交于
      Using the new fan-out helper for FDB entries installed on the software
      bridge, we can install host addresses with the proper refcount on the
      CPU port, such that this case:
      
      ip link set swp0 master br0
      ip link set swp1 master br0
      ip link set swp2 master br0
      ip link set swp3 master br0
      ip link set br0 address 00:01:02:03:04:05
      ip link set swp3 nomaster
      
      works properly and the br0 address remains installed as a host entry
      with refcount 3 instead of getting deleted.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b94dc99c
    • V
      net: switchdev: introduce helper for checking dynamically learned FDB entries · c6451cda
      Vladimir Oltean 提交于
      It is a bit difficult to understand what DSA checks when it tries to
      avoid installing dynamically learned addresses on foreign interfaces as
      local host addresses, so create a generic switchdev helper that can be
      reused and is generally more readable.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c6451cda
    • V
      net: dsa: tag_8021q: add proper cross-chip notifier support · c64b9c05
      Vladimir Oltean 提交于
      The big problem which mandates cross-chip notifiers for tag_8021q is
      this:
      
                                                   |
          sw0p0     sw0p1     sw0p2     sw0p3     sw0p4
       [  user ] [  user ] [  user ] [  dsa  ] [  cpu  ]
                                         |
                                         +---------+
                                                   |
          sw1p0     sw1p1     sw1p2     sw1p3     sw1p4
       [  user ] [  user ] [  user ] [  dsa  ] [  dsa  ]
                                         |
                                         +---------+
                                                   |
          sw2p0     sw2p1     sw2p2     sw2p3     sw2p4
       [  user ] [  user ] [  user ] [  dsa  ] [  dsa  ]
      
      When the user runs:
      
      ip link add br0 type bridge
      ip link set sw0p0 master br0
      ip link set sw2p0 master br0
      
      It doesn't work.
      
      This is because dsa_8021q_crosschip_bridge_join() assumes that "ds" and
      "other_ds" are at most 1 hop away from each other, so it is sufficient
      to add the RX VLAN of {ds, port} into {other_ds, other_port} and vice
      versa and presto, the cross-chip link works. When there is another
      switch in the middle, such as in this case switch 1 with its DSA links
      sw1p3 and sw1p4, somebody needs to tell it about these VLANs too.
      
      Which is exactly why the problem is quadratic: when a port joins a
      bridge, for each port in the tree that's already in that same bridge we
      notify a tag_8021q VLAN addition of that port's RX VLAN to the entire
      tree. It is a very complicated web of VLANs.
      
      It must be mentioned that currently we install tag_8021q VLANs on too
      many ports (DSA links - to be precise, on all of them). For example,
      when sw2p0 joins br0, and assuming sw1p0 was part of br0 too, we add the
      RX VLAN of sw2p0 on the DSA links of switch 0 too, even though there
      isn't any port of switch 0 that is a member of br0 (at least yet).
      In theory we could notify only the switches which sit in between the
      port joining the bridge and the port reacting to that bridge_join event.
      But in practice that is impossible, because of the way 'link' properties
      are described in the device tree. The DSA bindings require DT writers to
      list out not only the real/physical DSA links, but in fact the entire
      routing table, like for example switch 0 above will have:
      
      	sw0p3: port@3 {
      		link = <&sw1p4 &sw2p4>;
      	};
      
      This was done because:
      
      /* TODO: ideally DSA ports would have a single dp->link_dp member,
       * and no dst->rtable nor this struct dsa_link would be needed,
       * but this would require some more complex tree walking,
       * so keep it stupid at the moment and list them all.
       */
      
      but it is a perfect example of a situation where too much information is
      actively detrimential, because we are now in the position where we
      cannot distinguish a real DSA link from one that is put there to avoid
      the 'complex tree walking'. And because DT is ABI, there is not much we
      can change.
      
      And because we do not know which DSA links are real and which ones
      aren't, we can't really know if DSA switch A is in the data path between
      switches B and C, in the general case.
      
      So this is why tag_8021q RX VLANs are added on all DSA links, and
      probably why it will never change.
      
      On the other hand, at least the number of additions/deletions is well
      balanced, and this means that once we implement reference counting at
      the cross-chip notifier level a la fdb/mdb, there is absolutely zero
      need for a struct dsa_8021q_crosschip_link, it's all self-managing.
      
      In fact, with the tag_8021q notifiers emitted from the bridge join
      notifiers, it becomes so generic that sja1105 does not need to do
      anything anymore, we can just delete its implementation of the
      .crosschip_bridge_{join,leave} methods.
      
      Among other things we can simply delete is the home-grown implementation
      of sja1105_notify_crosschip_switches(). The reason why that is wrong is
      because it is not quadratic - it only covers remote switches to which we
      have a cross-chip bridging link and that does not cover in-between
      switches. This deletion is part of the same patch because sja1105 used
      to poke deep inside the guts of the tag_8021q context in order to do
      that. Because the cross-chip links went away, so needs the sja1105 code.
      
      Last but not least, dsa_8021q_setup_port() is simplified (and also
      renamed). Because our TAG_8021Q_VLAN_ADD notifier is designed to react
      on the CPU port too, the four dsa_8021q_vid_apply() calls:
      - 1 for RX VLAN on user port
      - 1 for the user port's RX VLAN on the CPU port
      - 1 for TX VLAN on user port
      - 1 for the user port's TX VLAN on the CPU port
      
      now get squashed into only 2 notifier calls via
      dsa_port_tag_8021q_vlan_add.
      
      And because the notifiers to add and to delete a tag_8021q VLAN are
      distinct, now we finally break up the port setup and teardown into
      separate functions instead of relying on a "bool enabled" flag which
      tells us what to do. Arguably it should have been this way from the
      get go.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c64b9c05
    • V
      net: dsa: tag_8021q: manage RX VLANs dynamically at bridge join/leave time · e19cc13c
      Vladimir Oltean 提交于
      There has been at least one wasted opportunity for tag_8021q to be used
      by a driver:
      
      https://patchwork.ozlabs.org/project/netdev/patch/20200710113611.3398-3-kurt@linutronix.de/#2484272
      
      because of a design decision: the declared purpose of tag_8021q is to
      offer source port/switch identification for a tagging driver for packets
      coming from a switch with no hardware DSA tagging support. It is not
      intended to provide VLAN-based port isolation, because its first user,
      sja1105, had another mechanism for bridging domain isolation, the L2
      Forwarding Table. So even if 2 ports are in the same VLAN but they are
      separated via the L2 Forwarding Table, they will not communicate with
      one another. The L2 Forwarding Table is managed by the
      sja1105_bridge_join() and sja1105_bridge_leave() methods.
      
      As a consequence, today tag_8021q does not bother too much with hooking
      into .port_bridge_join() and .port_bridge_leave() because that would
      introduce yet another degree of freedom, it just iterates statically
      through all ports of a switch and adds the RX VLAN of one port to all
      the others. In this way, whenever .port_bridge_join() is called,
      bridging will magically work because the RX VLANs are already installed
      everywhere they need to be.
      
      This is not to say that the reason for the change in this patch is to
      satisfy the hellcreek and similar use cases, that is merely a nice side
      effect. Instead it is to make sja1105 cross-chip links work properly
      over a DSA link.
      
      For context, sja1105 today supports a degenerate form of cross-chip
      bridging, where the switches are interconnected through their CPU ports
      ("disjoint trees" topology). There is some code which has been
      generalized into dsa_8021q_crosschip_link_{add,del}, but it is not
      enough, and frankly it is impossible to build upon that.
      Real multi-switch DSA trees, like daisy chains or H trees, which have
      actual DSA links, do not work.
      
      The problem is that sja1105 is unlike mv88e6xxx, and does not have a PVT
      for cross-chip bridging, which is a table by which the local switch can
      select the forwarding domain for packets from a certain ingress switch
      ID and source port. The sja1105 switches cannot parse their own DSA
      tags, because, well, they don't really have support for DSA tags, it's
      all VLANs.
      
      So to make something like cross-chip bridging between sw0p0 and sw1p0 to
      work over the sw0p3/sw1p3 DSA link to work with sja1105 in the topology
      below:
      
                               |                                  |
          sw0p0     sw0p1     sw0p2     sw0p3          sw1p3     sw1p2     sw1p1     sw1p0
       [  user ] [  user ] [  cpu  ] [  dsa  ] ---- [  dsa  ] [  cpu  ] [  user ] [  user ]
      
      we need to ask ourselves 2 questions:
      
      (1) how should the L2 Forwarding Table be managed?
      (2) how should the VLAN Lookup Table be managed?
      
      i.e. what should prevent packets from going to unwanted ports?
      
      Since as mentioned, there is no PVT, the L2 Forwarding Table only
      contains forwarding rules for local ports. So we can say "all user ports
      are allowed to forward to all CPU ports and all DSA links".
      
      If we allow forwarding to DSA links unconditionally, this means we must
      prevent forwarding using the VLAN Lookup Table. This is in fact
      asymmetric with what we do for tag_8021q on ports local to the same
      switch, and it matters because now that we are making tag_8021q a core
      DSA feature, we need to hook into .crosschip_bridge_join() to add/remove
      the tag_8021q VLANs. So for symmetry it makes sense to manage the VLANs
      for local forwarding in the same way as cross-chip forwarding.
      
      Note that there is a very precise reason why tag_8021q hooks into
      dsa_switch_bridge_join() which acts at the cross-chip notifier level,
      and not at a higher level such as dsa_port_bridge_join(). We need to
      install the RX VLAN of the newly joining port into the VLAN table of all
      the existing ports across the tree that are part of the same bridge, and
      the notifier already does the iteration through the switches for us.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e19cc13c
    • V
      net: dsa: tag_8021q: absorb dsa_8021q_setup into dsa_tag_8021q_{,un}register · 328621f6
      Vladimir Oltean 提交于
      Right now, setting up tag_8021q is a 2-step operation for a driver,
      first the context structure needs to be created, then the VLANs need to
      be installed on the ports. A similar thing is true for teardown.
      
      Merge the 2 steps into the register/unregister methods, to be as
      transparent as possible for the driver as to what tag_8021q does behind
      the scenes. This also gets rid of the funny "bool setup == true means
      setup, == false means teardown" API that tag_8021q used to expose.
      
      Note that dsa_tag_8021q_register() must be called at least in the
      .setup() driver method and never earlier (like in the driver probe
      function). This is because the DSA switch tree is not initialized at
      probe time, and the cross-chip notifiers will not work.
      
      For symmetry with .setup(), the unregister method should be put in
      .teardown().
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      328621f6
    • V
      net: dsa: make tag_8021q operations part of the core · 5da11eb4
      Vladimir Oltean 提交于
      Make tag_8021q a more central element of DSA and move the 2 driver
      specific operations outside of struct dsa_8021q_context (which is
      supposed to hold dynamic data and not really constant function
      pointers).
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5da11eb4