1. 09 12月, 2021 2 次提交
  2. 25 10月, 2021 2 次提交
    • V
      net: dsa: introduce locking for the address lists on CPU and DSA ports · 338a3a47
      Vladimir Oltean 提交于
      Now that the rtnl_mutex is going away for dsa_port_{host_,}fdb_{add,del},
      no one is serializing access to the address lists that DSA keeps for the
      purpose of reference counting on shared ports (CPU and cascade ports).
      
      It can happen for one dsa_switch_do_fdb_del to do list_del on a dp->fdbs
      element while another dsa_switch_do_fdb_{add,del} is traversing dp->fdbs.
      We need to avoid that.
      
      Currently dp->mdbs is not at risk, because dsa_switch_do_mdb_{add,del}
      still runs under the rtnl_mutex. But it would be nice if it would not
      depend on that being the case. So let's introduce a mutex per port (the
      address lists are per port too) and share it between dp->mdbs and
      dp->fdbs.
      
      The place where we put the locking is interesting. It could be tempting
      to put a DSA-level lock which still serializes calls to
      .port_fdb_{add,del}, but it would still not avoid concurrency with other
      driver code paths that are currently under rtnl_mutex (.port_fdb_dump,
      .port_fast_age). So it would add a very false sense of security (and
      adding a global switch-wide lock in DSA to resynchronize with the
      rtnl_lock is also counterproductive and hard).
      
      So the locking is intentionally done only where the dp->fdbs and dp->mdbs
      lists are traversed. That means, from a driver perspective, that
      .port_fdb_add will be called with the dp->addr_lists_lock mutex held on
      the CPU port, but not held on user ports. This is done so that driver
      writers are not encouraged to rely on any guarantee offered by
      dp->addr_lists_lock.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      338a3a47
    • D
      Revert "Merge branch 'dsa-rtnl'" · 2d7e73f0
      David S. Miller 提交于
      This reverts commit 965e6b26, reversing
      changes made to 4d98bb0d.
      2d7e73f0
  3. 24 10月, 2021 1 次提交
    • V
      net: dsa: introduce locking for the address lists on CPU and DSA ports · d3bd8924
      Vladimir Oltean 提交于
      Now that the rtnl_mutex is going away for dsa_port_{host_,}fdb_{add,del},
      no one is serializing access to the address lists that DSA keeps for the
      purpose of reference counting on shared ports (CPU and cascade ports).
      
      It can happen for one dsa_switch_do_fdb_del to do list_del on a dp->fdbs
      element while another dsa_switch_do_fdb_{add,del} is traversing dp->fdbs.
      We need to avoid that.
      
      Currently dp->mdbs is not at risk, because dsa_switch_do_mdb_{add,del}
      still runs under the rtnl_mutex. But it would be nice if it would not
      depend on that being the case. So let's introduce a mutex per port (the
      address lists are per port too) and share it between dp->mdbs and
      dp->fdbs.
      
      The place where we put the locking is interesting. It could be tempting
      to put a DSA-level lock which still serializes calls to
      .port_fdb_{add,del}, but it would still not avoid concurrency with other
      driver code paths that are currently under rtnl_mutex (.port_fdb_dump,
      .port_fast_age). So it would add a very false sense of security (and
      adding a global switch-wide lock in DSA to resynchronize with the
      rtnl_lock is also counterproductive and hard).
      
      So the locking is intentionally done only where the dp->fdbs and dp->mdbs
      lists are traversed. That means, from a driver perspective, that
      .port_fdb_add will be called with the dp->addr_lists_lock mutex held on
      the CPU port, but not held on user ports. This is done so that driver
      writers are not encouraged to rely on any guarantee offered by
      dp->addr_lists_lock.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d3bd8924
  4. 21 10月, 2021 2 次提交
  5. 20 10月, 2021 1 次提交
  6. 14 10月, 2021 1 次提交
  7. 09 10月, 2021 2 次提交
    • V
      net: dsa: hold rtnl_lock in dsa_switch_setup_tag_protocol · 1951b3f1
      Vladimir Oltean 提交于
      It was a documented fact that ds->ops->change_tag_protocol() offered
      rtnetlink mutex protection to the switch driver, since there was an
      ASSERT_RTNL right before the call in dsa_switch_change_tag_proto()
      (initiated from sysfs).
      
      The blamed commit introduced another call path for
      ds->ops->change_tag_protocol() which does not hold the rtnl_mutex.
      This is:
      
      dsa_tree_setup
      -> dsa_tree_setup_switches
         -> dsa_switch_setup
            -> dsa_switch_setup_tag_protocol
               -> ds->ops->change_tag_protocol()
         -> dsa_port_setup
            -> dsa_slave_create
               -> register_netdevice(slave_dev)
      -> dsa_tree_setup_master
         -> dsa_master_setup
            -> dev->dsa_ptr = cpu_dp
      
      The reason why the rtnl_mutex is held in the sysfs call path is to
      ensure that, once the master and all the DSA interfaces are down (which
      is required so that no packets flow), they remain down during the
      tagging protocol change.
      
      The above calling order illustrates the fact that it should not be risky
      to change the initial tagging protocol to the one specified in the
      device tree at the given time:
      
      - packets cannot enter the dsa_switch_rcv() packet type handler since
        netdev_uses_dsa() for the master will not yet return true, since
        dev->dsa_ptr has not yet been populated
      
      - packets cannot enter the dsa_slave_xmit() function because no DSA
        interface has yet been registered
      
      So from the DSA core's perspective, holding the rtnl_mutex is indeed not
      necessary.
      
      Yet, drivers may need to do things which need rtnl_mutex protection. For
      example:
      
      felix_set_tag_protocol
      -> felix_setup_tag_8021q
         -> dsa_tag_8021q_register
            -> dsa_tag_8021q_setup
               -> dsa_tag_8021q_port_setup
                  -> vlan_vid_add
                     -> ASSERT_RTNL
      
      These drivers do not really have a choice to take the rtnl_mutex
      themselves, since in the sysfs case, the rtnl_mutex is already held.
      
      Fixes: deff7107 ("net: dsa: Allow default tag protocol to be overridden from DT")
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1951b3f1
    • V
      net: dsa: fix bridge_num not getting cleared after ports leaving the bridge · 1bec0f05
      Vladimir Oltean 提交于
      The dp->bridge_num is zero-based, with -1 being the encoding for an
      invalid value. But dsa_bridge_num_put used to check for an invalid value
      by comparing bridge_num with 0, which is of course incorrect.
      
      The result is that the bridge_num will never get cleared by
      dsa_bridge_num_put, and further port joins to other bridges will get a
      bridge_num larger than the previous one, and once all the available
      bridges with TX forwarding offload supported by the hardware get
      exhausted, the TX forwarding offload feature is simply disabled.
      
      In the case of sja1105, 7 iterations of the loop below are enough to
      exhaust the TX forwarding offload bits, and further bridge joins operate
      without that feature.
      
      ip link add br0 type bridge vlan_filtering 1
      
      while :; do
              ip link set sw0p2 master br0 && sleep 1
              ip link set sw0p2 nomaster && sleep 1
      done
      
      This issue is enough of an indication that having the dp->bridge_num
      invalid encoding be a negative number is prone to bugs, so this will be
      changed to a one-based value, with the dp->bridge_num of zero being the
      indication of no bridge. However, that is material for net-next.
      
      Fixes: f5e165e7 ("net: dsa: track unique bridge numbers across all DSA switch trees")
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      1bec0f05
  8. 27 9月, 2021 1 次提交
  9. 22 9月, 2021 1 次提交
  10. 21 9月, 2021 2 次提交
    • V
      net: dsa: don't allocate the slave_mii_bus using devres · 5135e96a
      Vladimir Oltean 提交于
      The Linux device model permits both the ->shutdown and ->remove driver
      methods to get called during a shutdown procedure. Example: a DSA switch
      which sits on an SPI bus, and the SPI bus driver calls this on its
      ->shutdown method:
      
      spi_unregister_controller
      -> device_for_each_child(&ctlr->dev, NULL, __unregister);
         -> spi_unregister_device(to_spi_device(dev));
            -> device_del(&spi->dev);
      
      So this is a simple pattern which can theoretically appear on any bus,
      although the only other buses on which I've been able to find it are
      I2C:
      
      i2c_del_adapter
      -> device_for_each_child(&adap->dev, NULL, __unregister_client);
         -> i2c_unregister_device(client);
            -> device_unregister(&client->dev);
      
      The implication of this pattern is that devices on these buses can be
      unregistered after having been shut down. The drivers for these devices
      might choose to return early either from ->remove or ->shutdown if the
      other callback has already run once, and they might choose that the
      ->shutdown method should only perform a subset of the teardown done by
      ->remove (to avoid unnecessary delays when rebooting).
      
      So in other words, the device driver may choose on ->remove to not
      do anything (therefore to not unregister an MDIO bus it has registered
      on ->probe), because this ->remove is actually triggered by the
      device_shutdown path, and its ->shutdown method has already run and done
      the minimally required cleanup.
      
      This used to be fine until the blamed commit, but now, the following
      BUG_ON triggers:
      
      void mdiobus_free(struct mii_bus *bus)
      {
      	/* For compatibility with error handling in drivers. */
      	if (bus->state == MDIOBUS_ALLOCATED) {
      		kfree(bus);
      		return;
      	}
      
      	BUG_ON(bus->state != MDIOBUS_UNREGISTERED);
      	bus->state = MDIOBUS_RELEASED;
      
      	put_device(&bus->dev);
      }
      
      In other words, there is an attempt to free an MDIO bus which was not
      unregistered. The attempt to free it comes from the devres release
      callbacks of the SPI device, which are executed after the device is
      unregistered.
      
      I'm not saying that the fact that MDIO buses allocated using devres
      would automatically get unregistered wasn't strange. I'm just saying
      that the commit didn't care about auditing existing call paths in the
      kernel, and now, the following code sequences are potentially buggy:
      
      (a) devm_mdiobus_alloc followed by plain mdiobus_register, for a device
          located on a bus that unregisters its children on shutdown. After
          the blamed patch, either both the alloc and the register should use
          devres, or none should.
      
      (b) devm_mdiobus_alloc followed by plain mdiobus_register, and then no
          mdiobus_unregister at all in the remove path. After the blamed
          patch, nobody unregisters the MDIO bus anymore, so this is even more
          buggy than the previous case which needs a specific bus
          configuration to be seen, this one is an unconditional bug.
      
      In this case, DSA falls into category (a), it tries to be helpful and
      registers an MDIO bus on behalf of the switch, which might be on such a
      bus. I've no idea why it does it under devres.
      
      It does this on probe:
      
      	if (!ds->slave_mii_bus && ds->ops->phy_read)
      		alloc and register mdio bus
      
      and this on remove:
      
      	if (ds->slave_mii_bus && ds->ops->phy_read)
      		unregister mdio bus
      
      I _could_ imagine using devres because the condition used on remove is
      different than the condition used on probe. So strictly speaking, DSA
      cannot determine whether the ds->slave_mii_bus it sees on remove is the
      ds->slave_mii_bus that _it_ has allocated on probe. Using devres would
      have solved that problem. But nonetheless, the existing code already
      proceeds to unregister the MDIO bus, even though it might be
      unregistering an MDIO bus it has never registered. So I can only guess
      that no driver that implements ds->ops->phy_read also allocates and
      registers ds->slave_mii_bus itself.
      
      So in that case, if unregistering is fine, freeing must be fine too.
      
      Stop using devres and free the MDIO bus manually. This will make devres
      stop attempting to free a still registered MDIO bus on ->shutdown.
      
      Fixes: ac3a68d5 ("net: phy: don't abuse devres in devm_mdiobus_register()")
      Reported-by: NLino Sanfilippo <LinoSanfilippo@gmx.de>
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Tested-by: NLino Sanfilippo <LinoSanfilippo@gmx.de>
      Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5135e96a
    • V
      net: dsa: fix dsa_tree_setup error path · e5845aa0
      Vladimir Oltean 提交于
      Since the blamed commit, dsa_tree_teardown_switches() was split into two
      smaller functions, dsa_tree_teardown_switches and dsa_tree_teardown_ports.
      
      However, the error path of dsa_tree_setup stopped calling dsa_tree_teardown_ports.
      
      Fixes: a57d8c21 ("net: dsa: flush switchdev workqueue before tearing down CPU/DSA ports")
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e5845aa0
  11. 19 9月, 2021 2 次提交
    • V
      net: dsa: tear down devlink port regions when tearing down the devlink port on error · fd292c18
      Vladimir Oltean 提交于
      Commit 86f8b1c0 ("net: dsa: Do not make user port errors fatal")
      decided it was fine to ignore errors on certain ports that fail to
      probe, and go on with the ports that do probe fine.
      
      Commit fb6ec87f ("net: dsa: Fix type was not set for devlink port")
      noticed that devlink_port_type_eth_set(dlp, dp->slave); does not get
      called, and devlink notices after a timeout of 3600 seconds and prints a
      WARN_ON. So it went ahead to unregister the devlink port. And because
      there exists an UNUSED port flavour, we actually re-register the devlink
      port as UNUSED.
      
      Commit 08156ba4 ("net: dsa: Add devlink port regions support to
      DSA") added devlink port regions, which are set up by the driver and not
      by DSA.
      
      When we trigger the devlink port deregistration and reregistration as
      unused, devlink now prints another WARN_ON, from here:
      
      devlink_port_unregister:
      	WARN_ON(!list_empty(&devlink_port->region_list));
      
      So the port still has regions, which makes sense, because they were set
      up by the driver, and the driver doesn't know we're unregistering the
      devlink port.
      
      Somebody needs to tear them down, and optionally (actually it would be
      nice, to be consistent) set them up again for the new devlink port.
      
      But DSA's layering stays in our way quite badly here.
      
      The options I've considered are:
      
      1. Introduce a function in devlink to just change a port's type and
         flavour. No dice, devlink keeps a lot of state, it really wants the
         port to not be registered when you set its parameters, so changing
         anything can only be done by destroying what we currently have and
         recreating it.
      
      2. Make DSA cache the parameters passed to dsa_devlink_port_region_create,
         and the region returned, keep those in a list, then when the devlink
         port unregister needs to take place, the existing devlink regions are
         destroyed by DSA, and we replay the creation of new regions using the
         cached parameters. Problem: mv88e6xxx keeps the region pointers in
         chip->ports[port].region, and these will remain stale after DSA frees
         them. There are many things DSA can do, but updating mv88e6xxx's
         private pointers is not one of them.
      
      3. Just let the driver do it (i.e. introduce a very specific method
         called ds->ops->port_reinit_as_unused, which unregisters its devlink
         port devlink regions, then the old devlink port, then registers the
         new one, then the devlink port regions for it). While it does work,
         as opposed to the others, it's pretty horrible from an API
         perspective and we can do better.
      
      4. Introduce a new pair of methods, ->port_setup and ->port_teardown,
         which in the case of mv88e6xxx must register and unregister the
         devlink port regions. Call these 2 methods when the port must be
         reinitialized as unused.
      
      Naturally, I went for the 4th approach.
      
      Fixes: 08156ba4 ("net: dsa: Add devlink port regions support to DSA")
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fd292c18
    • V
      net: dsa: be compatible with masters which unregister on shutdown · 0650bf52
      Vladimir Oltean 提交于
      Lino reports that on his system with bcmgenet as DSA master and KSZ9897
      as a switch, rebooting or shutting down never works properly.
      
      What does the bcmgenet driver have special to trigger this, that other
      DSA masters do not? It has an implementation of ->shutdown which simply
      calls its ->remove implementation. Otherwise said, it unregisters its
      network interface on shutdown.
      
      This message can be seen in a loop, and it hangs the reboot process there:
      
      unregister_netdevice: waiting for eth0 to become free. Usage count = 3
      
      So why 3?
      
      A usage count of 1 is normal for a registered network interface, and any
      virtual interface which links itself as an upper of that will increment
      it via dev_hold. In the case of DSA, this is the call path:
      
      dsa_slave_create
      -> netdev_upper_dev_link
         -> __netdev_upper_dev_link
            -> __netdev_adjacent_dev_insert
               -> dev_hold
      
      So a DSA switch with 3 interfaces will result in a usage count elevated
      by two, and netdev_wait_allrefs will wait until they have gone away.
      
      Other stacked interfaces, like VLAN, watch NETDEV_UNREGISTER events and
      delete themselves, but DSA cannot just vanish and go poof, at most it
      can unbind itself from the switch devices, but that must happen strictly
      earlier compared to when the DSA master unregisters its net_device, so
      reacting on the NETDEV_UNREGISTER event is way too late.
      
      It seems that it is a pretty established pattern to have a driver's
      ->shutdown hook redirect to its ->remove hook, so the same code is
      executed regardless of whether the driver is unbound from the device, or
      the system is just shutting down. As Florian puts it, it is quite a big
      hammer for bcmgenet to unregister its net_device during shutdown, but
      having a common code path with the driver unbind helps ensure it is well
      tested.
      
      So DSA, for better or for worse, has to live with that and engage in an
      arms race of implementing the ->shutdown hook too, from all individual
      drivers, and do something sane when paired with masters that unregister
      their net_device there. The only sane thing to do, of course, is to
      unlink from the master.
      
      However, complications arise really quickly.
      
      The pattern of redirecting ->shutdown to ->remove is not unique to
      bcmgenet or even to net_device drivers. In fact, SPI controllers do it
      too (see dspi_shutdown -> dspi_remove), and presumably, I2C controllers
      and MDIO controllers do it too (this is something I have not researched
      too deeply, but even if this is not the case today, it is certainly
      plausible to happen in the future, and must be taken into consideration).
      
      Since DSA switches might be SPI devices, I2C devices, MDIO devices, the
      insane implication is that for the exact same DSA switch device, we
      might have both ->shutdown and ->remove getting called.
      
      So we need to do something with that insane environment. The pattern
      I've come up with is "if this, then not that", so if either ->shutdown
      or ->remove gets called, we set the device's drvdata to NULL, and in the
      other hook, we check whether the drvdata is NULL and just do nothing.
      This is probably not necessary for platform devices, just for devices on
      buses, but I would really insist for consistency among drivers, because
      when code is copy-pasted, it is not always copy-pasted from the best
      sources.
      
      So depending on whether the DSA switch's ->remove or ->shutdown will get
      called first, we cannot really guarantee even for the same driver if
      rebooting will result in the same code path on all platforms. But
      nonetheless, we need to do something minimally reasonable on ->shutdown
      too to fix the bug. Of course, the ->remove will do more (a full
      teardown of the tree, with all data structures freed, and this is why
      the bug was not caught for so long). The new ->shutdown method is kept
      separate from dsa_unregister_switch not because we couldn't have
      unregistered the switch, but simply in the interest of doing something
      quick and to the point.
      
      The big question is: does the DSA switch's ->shutdown get called earlier
      than the DSA master's ->shutdown? If not, there is still a risk that we
      might still trigger the WARN_ON in unregister_netdevice that says we are
      attempting to unregister a net_device which has uppers. That's no good.
      Although the reference to the master net_device won't physically go away
      even if DSA's ->shutdown comes afterwards, remember we have a dev_hold
      on it.
      
      The answer to that question lies in this comment above device_link_add:
      
       * A side effect of the link creation is re-ordering of dpm_list and the
       * devices_kset list by moving the consumer device and all devices depending
       * on it to the ends of these lists (that does not happen to devices that have
       * not been registered when this function is called).
      
      so the fact that DSA uses device_link_add towards its master is not
      exactly for nothing. device_shutdown() walks devices_kset from the back,
      so this is our guarantee that DSA's shutdown happens before the master's
      shutdown.
      
      Fixes: 2f1e8ea7 ("net: dsa: link interfaces with the DSA master to get rid of lockdep warnings")
      Link: https://lore.kernel.org/netdev/20210909095324.12978-1-LinoSanfilippo@gmx.de/Reported-by: NLino Sanfilippo <LinoSanfilippo@gmx.de>
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Tested-by: NAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0650bf52
  12. 16 9月, 2021 1 次提交
    • V
      net: dsa: flush switchdev workqueue before tearing down CPU/DSA ports · a57d8c21
      Vladimir Oltean 提交于
      Sometimes when unbinding the mv88e6xxx driver on Turris MOX, these error
      messages appear:
      
      mv88e6085 d0032004.mdio-mii:12: port 1 failed to delete be:79:b4:9e:9e:96 vid 1 from fdb: -2
      mv88e6085 d0032004.mdio-mii:12: port 1 failed to delete be:79:b4:9e:9e:96 vid 0 from fdb: -2
      mv88e6085 d0032004.mdio-mii:12: port 1 failed to delete d8:58:d7:00:ca:6d vid 100 from fdb: -2
      mv88e6085 d0032004.mdio-mii:12: port 1 failed to delete d8:58:d7:00:ca:6d vid 1 from fdb: -2
      mv88e6085 d0032004.mdio-mii:12: port 1 failed to delete d8:58:d7:00:ca:6d vid 0 from fdb: -2
      
      (and similarly for other ports)
      
      What happens is that DSA has a policy "even if there are bugs, let's at
      least not leak memory" and dsa_port_teardown() clears the dp->fdbs and
      dp->mdbs lists, which are supposed to be empty.
      
      But deleting that cleanup code, the warnings go away.
      
      => the FDB and MDB lists (used for refcounting on shared ports, aka CPU
      and DSA ports) will eventually be empty, but are not empty by the time
      we tear down those ports. Aka we are deleting them too soon.
      
      The addresses that DSA complains about are host-trapped addresses: the
      local addresses of the ports, and the MAC address of the bridge device.
      
      The problem is that offloading those entries happens from a deferred
      work item scheduled by the SWITCHDEV_FDB_DEL_TO_DEVICE handler, and this
      races with the teardown of the CPU and DSA ports where the refcounting
      is kept.
      
      In fact, not only it races, but fundamentally speaking, if we iterate
      through the port list linearly, we might end up tearing down the shared
      ports even before we delete a DSA user port which has a bridge upper.
      
      So as it turns out, we need to first tear down the user ports (and the
      unused ones, for no better place of doing that), then the shared ports
      (the CPU and DSA ports). In between, we need to ensure that all work
      items scheduled by our switchdev handlers (which only run for user
      ports, hence the reason why we tear them down first) have finished.
      
      Fixes: 161ca59d ("net: dsa: reference count the MDB entries at the cross-chip notifier level")
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Link: https://lore.kernel.org/r/20210914134726.2305133-1-vladimir.oltean@nxp.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      a57d8c21
  13. 23 8月, 2021 1 次提交
    • V
      net: dsa: track unique bridge numbers across all DSA switch trees · f5e165e7
      Vladimir Oltean 提交于
      Right now, cross-tree bridging setups work somewhat by mistake.
      
      In the case of cross-tree bridging with sja1105, all switch instances
      need to agree upon a common VLAN ID for forwarding a packet that belongs
      to a certain bridging domain.
      
      With TX forwarding offload, the VLAN ID is the bridge VLAN for
      VLAN-aware bridging, and the tag_8021q TX forwarding offload VID
      (a VLAN which has non-zero VBID bits) for VLAN-unaware bridging.
      
      The VBID for VLAN-unaware bridging is derived from the dp->bridge_num
      value calculated by DSA independently for each switch tree.
      
      If ports from one tree join one bridge, and ports from another tree join
      another bridge, DSA will assign them the same bridge_num, even though
      the bridges are different. If cross-tree bridging is supported, this
      is an issue.
      
      Modify DSA to calculate the bridge_num globally across all switch trees.
      This has the implication for a driver that the dp->bridge_num value that
      DSA will assign to its ports might not be contiguous, if there are
      boards with multiple DSA drivers instantiated. Additionally, all
      bridge_num values eat up towards each switch's
      ds->num_fwd_offloading_bridges maximum, which is potentially unfortunate,
      and can be seen as a limitation introduced by this patch. However, that
      is the lesser evil for now.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f5e165e7
  14. 12 8月, 2021 1 次提交
    • V
      net: dsa: tag_8021q: don't broadcast during setup/teardown · 724395f4
      Vladimir Oltean 提交于
      Currently, on my board with multiple sja1105 switches in disjoint trees
      described in commit f66a6a69 ("net: dsa: permit cross-chip bridging
      between all trees in the system"), rebooting the board triggers the
      following benign warnings:
      
      [   12.345566] sja1105 spi2.0: port 0 failed to notify tag_8021q VLAN 1088 deletion: -ENOENT
      [   12.353804] sja1105 spi2.0: port 0 failed to notify tag_8021q VLAN 2112 deletion: -ENOENT
      [   12.362019] sja1105 spi2.0: port 1 failed to notify tag_8021q VLAN 1089 deletion: -ENOENT
      [   12.370246] sja1105 spi2.0: port 1 failed to notify tag_8021q VLAN 2113 deletion: -ENOENT
      [   12.378466] sja1105 spi2.0: port 2 failed to notify tag_8021q VLAN 1090 deletion: -ENOENT
      [   12.386683] sja1105 spi2.0: port 2 failed to notify tag_8021q VLAN 2114 deletion: -ENOENT
      
      Basically switch 1 calls dsa_tag_8021q_unregister, and switch 1's TX and
      RX VLANs cannot be found on switch 2's CPU port.
      
      But why would switch 2 even attempt to delete switch 1's TX and RX
      tag_8021q VLANs from its CPU port? Well, because we use dsa_broadcast,
      and it is supposed that it had added those VLANs in the first place
      (because in dsa_port_tag_8021q_vlan_match, all CPU ports match
      regardless of their tree index or switch index).
      
      The two trees probe asynchronously, and when switch 1 probed, it called
      dsa_broadcast which did not notify the tree of switch 2, because that
      didn't probe yet. But during unbind, switch 2's tree _is_ probed, so it
      _is_ notified of the deletion.
      
      Before jumping to introduce a synchronization mechanism between the
      probing across disjoint switch trees, let's take a step back and see
      whether we _need_ to do that in the first place.
      
      The RX and TX VLANs of switch 1 would be needed on switch 2's CPU port
      only if switch 1 and 2 were part of a cross-chip bridge. And
      dsa_tag_8021q_bridge_join takes care precisely of that (but if probing
      was synchronous, the bridge_join would just end up bumping the VLANs'
      refcount, because they are already installed by the setup path).
      
      Since by the time the ports are bridged, all DSA trees are already set
      up, and we don't need the tag_8021q VLANs of one switch installed on the
      other switches during probe time, the answer is that we don't need to
      fix the synchronization issue.
      
      So make the setup and teardown code paths call dsa_port_notify, which
      notifies only the local tree, and the bridge code paths call
      dsa_broadcast, which let the other trees know as well.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      724395f4
  15. 09 8月, 2021 1 次提交
    • L
      devlink: Set device as early as possible · 919d13a7
      Leon Romanovsky 提交于
      All kernel devlink implementations call to devlink_alloc() during
      initialization routine for specific device which is used later as
      a parent device for devlink_register().
      
      Such late device assignment causes to the situation which requires us to
      call to device_register() before setting other parameters, but that call
      opens devlink to the world and makes accessible for the netlink users.
      
      Any attempt to move devlink_register() to be the last call generates the
      following error due to access to the devlink->dev pointer.
      
      [    8.758862]  devlink_nl_param_fill+0x2e8/0xe50
      [    8.760305]  devlink_param_notify+0x6d/0x180
      [    8.760435]  __devlink_params_register+0x2f1/0x670
      [    8.760558]  devlink_params_register+0x1e/0x20
      
      The simple change of API to set devlink device in the devlink_alloc()
      instead of devlink_register() fixes all this above and ensures that
      prior to call to devlink_register() everything already set.
      Signed-off-by: NLeon Romanovsky <leonro@nvidia.com>
      Reviewed-by: NJiri Pirko <jiri@nvidia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      919d13a7
  16. 05 8月, 2021 2 次提交
    • V
      net: dsa: give preference to local CPU ports · 2c0b0325
      Vladimir Oltean 提交于
      Be there an "H" switch topology, where there are 2 switches connected as
      follows:
      
               eth0                                                     eth1
                |                                                        |
             CPU port                                                CPU port
                |                        DSA link                        |
       sw0p0  sw0p1  sw0p2  sw0p3  sw0p4 -------- sw1p4  sw1p3  sw1p2  sw1p1  sw1p0
         |             |      |                            |      |             |
       user          user   user                         user   user          user
       port          port   port                         port   port          port
      
      basically one where each switch has its own CPU port for termination,
      but there is also a DSA link in case packets need to be forwarded in
      hardware between one switch and another.
      
      DSA insists to see this as a daisy chain topology, basically registering
      all network interfaces as sw0p0@eth0, ... sw1p0@eth0 and disregarding
      eth1 as a valid DSA master.
      
      This is only half the story, since when asked using dsa_port_is_cpu(),
      DSA will respond that sw1p1 is a CPU port, however one which has no
      dp->cpu_dp pointing to it. So sw1p1 is enabled, but not used.
      
      Furthermore, be there a driver for switches which support only one
      upstream port. This driver iterates through its ports and checks using
      dsa_is_upstream_port() whether the current port is an upstream one.
      For switch 1, two ports pass the "is upstream port" checks:
      
      - sw1p4 is an upstream port because it is a routing port towards the
        dedicated CPU port assigned using dsa_tree_setup_default_cpu()
      
      - sw1p1 is also an upstream port because it is a CPU port, albeit one
        that is disabled. This is because dsa_upstream_port() returns:
      
      	if (!cpu_dp)
      		return port;
      
        which means that if @dp does not have a ->cpu_dp pointer (which is a
        characteristic of CPU ports themselves as well as unused ports), then
        @dp is its own upstream port.
      
      So the driver for switch 1 rightfully says: I have two upstream ports,
      but I don't support multiple upstream ports! So let me error out, I
      don't know which one to choose and what to do with the other one.
      
      Generally I am against enforcing any default policy in the kernel in
      terms of user to CPU port assignment (like round robin or such) but this
      case is different. To solve the conundrum, one would have to:
      
      - Disable sw1p1 in the device tree or mark it as "not a CPU port" in
        order to comply with DSA's view of this topology as a daisy chain,
        where the termination traffic from switch 1 must pass through switch 0.
        This is counter-productive because it wastes 1Gbps of termination
        throughput in switch 1.
      - Disable the DSA link between sw0p4 and sw1p4 and do software
        forwarding between switch 0 and 1, and basically treat the switches as
        part of disjoint switch trees. This is counter-productive because it
        wastes 1Gbps of autonomous forwarding throughput between switch 0 and 1.
      - Treat sw0p4 and sw1p4 as user ports instead of DSA links. This could
        work, but it makes cross-chip bridging impossible. In this setup we
        would need to have 2 separate bridges, br0 spanning the ports of
        switch 0, and br1 spanning the ports of switch 1, and the "DSA links
        treated as user ports" sw0p4 (part of br0) and sw1p4 (part of br1) are
        the gateway ports between one bridge and another. This is hard to
        manage from a user's perspective, who wants to have a unified view of
        the switching fabric and the ability to transparently add ports to the
        same bridge. VLANs would also need to be explicitly managed by the
        user on these gateway ports.
      
      So it seems that the only reasonable thing to do is to make DSA prefer
      CPU ports that are local to the switch. Meaning that by default, the
      user and DSA ports of switch 0 will get assigned to the CPU port from
      switch 0 (sw0p1) and the user and DSA ports of switch 1 will get
      assigned to the CPU port from switch 1.
      
      The way this solves the problem is that sw1p4 is no longer an upstream
      port as far as switch 1 is concerned (it no longer views sw0p1 as its
      dedicated CPU port).
      
      So here we are, the first multi-CPU port that DSA supports is also
      perhaps the most uneventful one: the individual switches don't support
      multiple CPUs, however the DSA switch tree as a whole does have multiple
      CPU ports. No user space assignment of user ports to CPU ports is
      desirable, necessary, or possible.
      
      Ports that do not have a local CPU port (say there was an extra switch
      hanging off of sw0p0) default to the standard implementation of getting
      assigned to the first CPU port of the DSA switch tree. Is that good
      enough? Probably not (if the downstream switch was hanging off of switch
      1, we would most certainly prefer its CPU port to be sw1p1), but in
      order to support that use case too, we would need to traverse the
      dst->rtable in search of an optimum dedicated CPU port, one that has the
      smallest number of hops between dp->ds and dp->cpu_dp->ds. At the
      moment, the DSA routing table structure does not keep the number of hops
      between dl->dp and dl->link_dp, and while it is probably deducible,
      there is zero justification to write that code now. Let's hope DSA will
      never have to support that use case.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2c0b0325
    • V
      net: dsa: rename teardown_default_cpu to teardown_cpu_ports · 0e8eb9a1
      Vladimir Oltean 提交于
      There is nothing specific to having a default CPU port to what
      dsa_tree_teardown_default_cpu() does. Even with multiple CPU ports,
      it would do the same thing: iterate through the ports of this switch
      tree and reset the ->cpu_dp pointer to NULL. So rename it accordingly.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0e8eb9a1
  17. 23 7月, 2021 2 次提交
    • V
      net: dsa: add support for bridge TX forwarding offload · 123abc06
      Vladimir Oltean 提交于
      For a DSA switch, to offload the forwarding process of a bridge device
      means to send the packets coming from the software bridge as data plane
      packets. This is contrary to everything that DSA has done so far,
      because the current taggers only know to send control packets (ones that
      target a specific destination port), whereas data plane packets are
      supposed to be forwarded according to the FDB lookup, much like packets
      ingressing on any regular ingress port. If the FDB lookup process
      returns multiple destination ports (flooding, multicast), then
      replication is also handled by the switch hardware - the bridge only
      sends a single packet and avoids the skb_clone().
      
      DSA keeps for each bridge port a zero-based index (the number of the
      bridge). Multiple ports performing TX forwarding offload to the same
      bridge have the same dp->bridge_num value, and ports not offloading the
      TX data plane of a bridge have dp->bridge_num = -1.
      
      The tagger can check if the packet that is being transmitted on has
      skb->offload_fwd_mark = true or not. If it does, it can be sure that the
      packet belongs to the data plane of a bridge, further information about
      which can be obtained based on dp->bridge_dev and dp->bridge_num.
      It can then compose a DSA tag for injecting a data plane packet into
      that bridge number.
      
      For the switch driver side, we offer two new dsa_switch_ops methods,
      called .port_bridge_fwd_offload_{add,del}, which are modeled after
      .port_bridge_{join,leave}.
      These methods are provided in case the driver needs to configure the
      hardware to treat packets coming from that bridge software interface as
      data plane packets. The switchdev <-> bridge interaction happens during
      the netdev_master_upper_dev_link() call, so to switch drivers, the
      effect is that the .port_bridge_fwd_offload_add() method is called
      immediately after .port_bridge_join().
      
      If the bridge number exceeds the number of bridges for which the switch
      driver can offload the TX data plane (and this includes the case where
      the driver can offload none), DSA falls back to simply returning
      tx_fwd_offload = false in the switchdev_bridge_port_offload() call.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      123abc06
    • V
      net: dsa: track the number of switches in a tree · 5b22d366
      Vladimir Oltean 提交于
      In preparation of supporting data plane forwarding on behalf of a
      software bridge, some drivers might need to view bridges as virtual
      switches behind the CPU port in a cross-chip topology.
      
      Give them some help and let them know how many physical switches there
      are in the tree, so that they can count the virtual switches starting
      from that number on.
      
      Note that the first dsa_switch_ops method where this information is
      reliably available is .setup(). This is because of how DSA works:
      in a tree with 3 switches, each calling dsa_register_switch(), the first
      2 will advance until dsa_tree_setup() -> dsa_tree_setup_routing_table()
      and exit with error code 0 because the topology is not complete. Since
      probing is parallel at this point, one switch does not know about the
      existence of the other. Then the third switch comes, and for it,
      dsa_tree_setup_routing_table() returns complete = true. This switch goes
      ahead and calls dsa_tree_setup_switches() for everybody else, calling
      their .setup() methods too. This acts as the synchronization point.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5b22d366
  18. 30 6月, 2021 2 次提交
    • V
      net: dsa: reference count the FDB addresses at the cross-chip notifier level · 3f6e32f9
      Vladimir Oltean 提交于
      The same concerns expressed for host MDB entries are valid for host FDBs
      just as well:
      
      - in the case of multiple bridges spanning the same switch chip, deleting
        a host FDB entry that belongs to one bridge will result in breakage to
        the other bridge
      - not deleting FDB entries across DSA links means that the switch's
        hardware tables will eventually run out, given enough wear&tear
      
      So do the same thing and introduce reference counting for CPU ports and
      DSA links using the same data structures as we have for MDB entries.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3f6e32f9
    • V
      net: dsa: reference count the MDB entries at the cross-chip notifier level · 161ca59d
      Vladimir Oltean 提交于
      Ever since the cross-chip notifiers were introduced, the design was
      meant to be simplistic and just get the job done without worrying too
      much about dangling resources left behind.
      
      For example, somebody installs an MDB entry on sw0p0 in this daisy chain
      topology. It gets installed using ds->ops->port_mdb_add() on sw0p0,
      sw1p4 and sw2p4.
      
                                                          |
                 sw0p0     sw0p1     sw0p2     sw0p3     sw0p4
              [  user ] [  user ] [  user ] [  dsa  ] [  cpu  ]
              [   x   ] [       ] [       ] [       ] [       ]
                                                |
                                                +---------+
                                                          |
                 sw1p0     sw1p1     sw1p2     sw1p3     sw1p4
              [  user ] [  user ] [  user ] [  dsa  ] [  dsa  ]
              [       ] [       ] [       ] [       ] [   x   ]
                                                |
                                                +---------+
                                                          |
                 sw2p0     sw2p1     sw2p2     sw2p3     sw2p4
              [  user ] [  user ] [  user ] [  user ] [  dsa  ]
              [       ] [       ] [       ] [       ] [   x   ]
      
      Then the same person deletes that MDB entry. The cross-chip notifier for
      deletion only matches sw0p0:
      
                                                          |
                 sw0p0     sw0p1     sw0p2     sw0p3     sw0p4
              [  user ] [  user ] [  user ] [  dsa  ] [  cpu  ]
              [   x   ] [       ] [       ] [       ] [       ]
                                                |
                                                +---------+
                                                          |
                 sw1p0     sw1p1     sw1p2     sw1p3     sw1p4
              [  user ] [  user ] [  user ] [  dsa  ] [  dsa  ]
              [       ] [       ] [       ] [       ] [       ]
                                                |
                                                +---------+
                                                          |
                 sw2p0     sw2p1     sw2p2     sw2p3     sw2p4
              [  user ] [  user ] [  user ] [  user ] [  dsa  ]
              [       ] [       ] [       ] [       ] [       ]
      
      Why?
      
      Because the DSA links are 'trunk' ports, if we just go ahead and delete
      the MDB from sw1p4 and sw2p4 directly, we might delete those multicast
      entries when they are still needed. Just consider the fact that somebody
      does:
      
      - add a multicast MAC address towards sw0p0 [ via the cross-chip
        notifiers it gets installed on the DSA links too ]
      - add the same multicast MAC address towards sw0p1 (another port of that
        same switch)
      - delete the same multicast MAC address from sw0p0.
      
      At this point, if we deleted the MAC address from the DSA links, it
      would be flooded, even though there is still an entry on switch 0 which
      needs it not to.
      
      So that is why deletions only match the targeted source port and nothing
      on DSA links. Of course, dangling resources means that the hardware
      tables will eventually run out given enough additions/removals, but hey,
      at least it's simple.
      
      But there is a bigger concern which needs to be addressed, and that is
      our support for SWITCHDEV_OBJ_ID_HOST_MDB. DSA simply translates such an
      object into a dsa_port_host_mdb_add() which ends up as ds->ops->port_mdb_add()
      on the upstream port, and a similar thing happens on deletion:
      dsa_port_host_mdb_del() will trigger ds->ops->port_mdb_del() on the
      upstream port.
      
      When there are 2 VLAN-unaware bridges spanning the same switch (which is
      a use case DSA proudly supports), each bridge will install its own
      SWITCHDEV_OBJ_ID_HOST_MDB entries. But upon deletion, DSA goes ahead and
      emits a DSA_NOTIFIER_MDB_DEL for dp->cpu_dp, which is shared between the
      user ports enslaved to br0 and the user ports enslaved to br1. Not good.
      The host-trapped multicast addresses installed by br1 will be deleted
      when any state changes in br0 (IGMP timers expire, or ports leave, etc).
      
      To avoid this, we could of course go the route of the zero-sum game and
      delete the DSA_NOTIFIER_MDB_DEL call for dp->cpu_dp. But the better
      design is to just admit that on shared ports like DSA links and CPU
      ports, we should be reference counting calls, even if this consumes some
      dynamic memory which DSA has traditionally avoided. On the flip side,
      the hardware tables of switches are limited in size, so it would be good
      if the OS managed them properly instead of having them eventually
      overflow.
      
      To address the memory usage concern, we only apply the refcounting of
      MDB entries on ports that are really shared (CPU ports and DSA links)
      and not on user ports. In a typical single-switch setup, this means only
      the CPU port (and the host MDB entries are not that many, really).
      
      The name of the newly introduced data structures (dsa_mac_addr) is
      chosen in such a way that will be reusable for host FDB entries (next
      patch).
      
      With this change, we can finally have the same matching logic for the
      MDB additions and deletions, as well as for their host-trapped variants.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      161ca59d
  19. 22 6月, 2021 2 次提交
  20. 21 4月, 2021 1 次提交
  21. 14 4月, 2021 1 次提交
    • M
      of: net: pass the dst buffer to of_get_mac_address() · 83216e39
      Michael Walle 提交于
      of_get_mac_address() returns a "const void*" pointer to a MAC address.
      Lately, support to fetch the MAC address by an NVMEM provider was added.
      But this will only work with platform devices. It will not work with
      PCI devices (e.g. of an integrated root complex) and esp. not with DSA
      ports.
      
      There is an of_* variant of the nvmem binding which works without
      devices. The returned data of a nvmem_cell_read() has to be freed after
      use. On the other hand the return of_get_mac_address() points to some
      static data without a lifetime. The trick for now, was to allocate a
      device resource managed buffer which is then returned. This will only
      work if we have an actual device.
      
      Change it, so that the caller of of_get_mac_address() has to supply a
      buffer where the MAC address is written to. Unfortunately, this will
      touch all drivers which use the of_get_mac_address().
      
      Usually the code looks like:
      
        const char *addr;
        addr = of_get_mac_address(np);
        if (!IS_ERR(addr))
          ether_addr_copy(ndev->dev_addr, addr);
      
      This can then be simply rewritten as:
      
        of_get_mac_address(np, ndev->dev_addr);
      
      Sometimes is_valid_ether_addr() is used to test the MAC address.
      of_get_mac_address() already makes sure, it just returns a valid MAC
      address. Thus we can just test its return code. But we have to be
      careful if there are still other sources for the MAC address before the
      of_get_mac_address(). In this case we have to keep the
      is_valid_ether_addr() call.
      
      The following coccinelle patch was used to convert common cases to the
      new style. Afterwards, I've manually gone over the drivers and fixed the
      return code variable: either used a new one or if one was already
      available use that. Mansour Moufid, thanks for that coccinelle patch!
      
      <spml>
      @a@
      identifier x;
      expression y, z;
      @@
      - x = of_get_mac_address(y);
      + x = of_get_mac_address(y, z);
        <...
      - ether_addr_copy(z, x);
        ...>
      
      @@
      identifier a.x;
      @@
      - if (<+... x ...+>) {}
      
      @@
      identifier a.x;
      @@
        if (<+... x ...+>) {
            ...
        }
      - else {}
      
      @@
      identifier a.x;
      expression e;
      @@
      - if (<+... x ...+>@e)
      -     {}
      - else
      + if (!(e))
            {...}
      
      @@
      expression x, y, z;
      @@
      - x = of_get_mac_address(y, z);
      + of_get_mac_address(y, z);
        ... when != x
      </spml>
      
      All drivers, except drivers/net/ethernet/aeroflex/greth.c, were
      compile-time tested.
      Suggested-by: NAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: NMichael Walle <michael@walle.cc>
      Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      83216e39
  22. 30 3月, 2021 1 次提交
  23. 23 3月, 2021 1 次提交
  24. 05 2月, 2021 1 次提交
  25. 30 1月, 2021 3 次提交
    • V
      net: dsa: allow changing the tag protocol via the "tagging" device attribute · 53da0eba
      Vladimir Oltean 提交于
      Currently DSA exposes the following sysfs:
      $ cat /sys/class/net/eno2/dsa/tagging
      ocelot
      
      which is a read-only device attribute, introduced in the kernel as
      commit 98cdb480 ("net: dsa: Expose tagging protocol to user-space"),
      and used by libpcap since its commit 993db3800d7d ("Add support for DSA
      link-layer types").
      
      It would be nice if we could extend this device attribute by making it
      writable:
      $ echo ocelot-8021q > /sys/class/net/eno2/dsa/tagging
      
      This is useful with DSA switches that can make use of more than one
      tagging protocol. It may be useful in dsa_loop in the future too, to
      perform offline testing of various taggers, or for changing between dsa
      and edsa on Marvell switches, if that is desirable.
      
      In terms of implementation, drivers can support this feature by
      implementing .change_tag_protocol, which should always leave the switch
      in a consistent state: either with the new protocol if things went well,
      or with the old one if something failed. Teardown of the old protocol,
      if necessary, must be handled by the driver.
      
      Some things remain as before:
      - The .get_tag_protocol is currently only called at probe time, to load
        the initial tagging protocol driver. Nonetheless, new drivers should
        report the tagging protocol in current use now.
      - The driver should manage by itself the initial setup of tagging
        protocol, no later than the .setup() method, as well as destroying
        resources used by the last tagger in use, no earlier than the
        .teardown() method.
      
      For multi-switch DSA trees, error handling is a bit more complicated,
      since e.g. the 5th out of 7 switches may fail to change the tag
      protocol. When that happens, a revert to the original tag protocol is
      attempted, but that may fail too, leaving the tree in an inconsistent
      state despite each individual switch implementing .change_tag_protocol
      transactionally. Since the intersection between drivers that implement
      .change_tag_protocol and drivers that support D in DSA is currently the
      empty set, the possibility for this error to happen is ignored for now.
      
      Testing:
      
      $ insmod mscc_felix.ko
      [   79.549784] mscc_felix 0000:00:00.5: Adding to iommu group 14
      [   79.565712] mscc_felix 0000:00:00.5: Failed to register DSA switch: -517
      $ insmod tag_ocelot.ko
      $ rmmod mscc_felix.ko
      $ insmod mscc_felix.ko
      [   97.261724] libphy: VSC9959 internal MDIO bus: probed
      [   97.267363] mscc_felix 0000:00:00.5: Found PCS at internal MDIO address 0
      [   97.274998] mscc_felix 0000:00:00.5: Found PCS at internal MDIO address 1
      [   97.282561] mscc_felix 0000:00:00.5: Found PCS at internal MDIO address 2
      [   97.289700] mscc_felix 0000:00:00.5: Found PCS at internal MDIO address 3
      [   97.599163] mscc_felix 0000:00:00.5 swp0 (uninitialized): PHY [0000:00:00.3:10] driver [Microsemi GE VSC8514 SyncE] (irq=POLL)
      [   97.862034] mscc_felix 0000:00:00.5 swp1 (uninitialized): PHY [0000:00:00.3:11] driver [Microsemi GE VSC8514 SyncE] (irq=POLL)
      [   97.950731] mscc_felix 0000:00:00.5 swp0: configuring for inband/qsgmii link mode
      [   97.964278] 8021q: adding VLAN 0 to HW filter on device swp0
      [   98.146161] mscc_felix 0000:00:00.5 swp2 (uninitialized): PHY [0000:00:00.3:12] driver [Microsemi GE VSC8514 SyncE] (irq=POLL)
      [   98.238649] mscc_felix 0000:00:00.5 swp1: configuring for inband/qsgmii link mode
      [   98.251845] 8021q: adding VLAN 0 to HW filter on device swp1
      [   98.433916] mscc_felix 0000:00:00.5 swp3 (uninitialized): PHY [0000:00:00.3:13] driver [Microsemi GE VSC8514 SyncE] (irq=POLL)
      [   98.485542] mscc_felix 0000:00:00.5: configuring for fixed/internal link mode
      [   98.503584] mscc_felix 0000:00:00.5: Link is Up - 2.5Gbps/Full - flow control rx/tx
      [   98.527948] device eno2 entered promiscuous mode
      [   98.544755] DSA: tree 0 setup
      
      $ ping 10.0.0.1
      PING 10.0.0.1 (10.0.0.1): 56 data bytes
      64 bytes from 10.0.0.1: seq=0 ttl=64 time=2.337 ms
      64 bytes from 10.0.0.1: seq=1 ttl=64 time=0.754 ms
      ^C
       -  10.0.0.1 ping statistics  -
      2 packets transmitted, 2 packets received, 0% packet loss
      round-trip min/avg/max = 0.754/1.545/2.337 ms
      
      $ cat /sys/class/net/eno2/dsa/tagging
      ocelot
      $ cat ./test_ocelot_8021q.sh
              #!/bin/bash
      
              ip link set swp0 down
              ip link set swp1 down
              ip link set swp2 down
              ip link set swp3 down
              ip link set swp5 down
              ip link set eno2 down
              echo ocelot-8021q > /sys/class/net/eno2/dsa/tagging
              ip link set eno2 up
              ip link set swp0 up
              ip link set swp1 up
              ip link set swp2 up
              ip link set swp3 up
              ip link set swp5 up
      $ ./test_ocelot_8021q.sh
      ./test_ocelot_8021q.sh: line 9: echo: write error: Protocol not available
      $ rmmod tag_ocelot.ko
      rmmod: can't unload module 'tag_ocelot': Resource temporarily unavailable
      $ insmod tag_ocelot_8021q.ko
      $ ./test_ocelot_8021q.sh
      $ cat /sys/class/net/eno2/dsa/tagging
      ocelot-8021q
      $ rmmod tag_ocelot.ko
      $ rmmod tag_ocelot_8021q.ko
      rmmod: can't unload module 'tag_ocelot_8021q': Resource temporarily unavailable
      $ ping 10.0.0.1
      PING 10.0.0.1 (10.0.0.1): 56 data bytes
      64 bytes from 10.0.0.1: seq=0 ttl=64 time=0.953 ms
      64 bytes from 10.0.0.1: seq=1 ttl=64 time=0.787 ms
      64 bytes from 10.0.0.1: seq=2 ttl=64 time=0.771 ms
      $ rmmod mscc_felix.ko
      [  645.544426] mscc_felix 0000:00:00.5: Link is Down
      [  645.838608] DSA: tree 0 torn down
      $ rmmod tag_ocelot_8021q.ko
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      53da0eba
    • V
      net: dsa: keep a copy of the tagging protocol in the DSA switch tree · 357f203b
      Vladimir Oltean 提交于
      Cascading DSA switches can be done multiple ways. There is the brute
      force approach / tag stacking, where one upstream switch, located
      between leaf switches and the host Ethernet controller, will just
      happily transport the DSA header of those leaf switches as payload.
      For this kind of setups, DSA works without any special kind of treatment
      compared to a single switch - they just aren't aware of each other.
      Then there's the approach where the upstream switch understands the tags
      it transports from its leaves below, as it doesn't push a tag of its own,
      but it routes based on the source port & switch id information present
      in that tag (as opposed to DMAC & VID) and it strips the tag when
      egressing a front-facing port. Currently only Marvell implements the
      latter, and Marvell DSA trees contain only Marvell switches.
      
      So it is safe to say that DSA trees already have a single tag protocol
      shared by all switches, and in fact this is what makes the switches able
      to understand each other. This fact is also implied by the fact that
      currently, the tagging protocol is reported as part of a sysfs installed
      on the DSA master and not per port, so it must be the same for all the
      ports connected to that DSA master regardless of the switch that they
      belong to.
      
      It's time to make this official and enforce it (yes, this also means we
      won't have any "switch understands tag to some extent but is not able to
      speak it" hardware oddities that we'll support in the future).
      
      This is needed due to the imminent introduction of the dsa_switch_ops::
      change_tag_protocol driver API. When that is introduced, we'll have
      to notify switches of the tagging protocol that they're configured to
      use. Currently the tag_ops structure pointer is held only for CPU ports.
      But there are switches which don't have CPU ports and nonetheless still
      need to be configured. These would be Marvell leaf switches whose
      upstream port is just a DSA link. How do we inform these of their
      tagging protocol setup/deletion?
      
      One answer to the above would be: iterate through the DSA switch tree's
      ports once, list the CPU ports, get their tag_ops, then iterate again
      now that we have it, and notify everybody of that tag_ops. But what to
      do if conflicts appear between one cpu_dp->tag_ops and another? There's
      no escaping the fact that conflict resolution needs to be done, so we
      can be upfront about it.
      
      Ease our work and just keep the master copy of the tag_ops inside the
      struct dsa_switch_tree. Reference counting is now moved to be per-tree
      too, instead of per-CPU port.
      
      There are many places in the data path that access master->dsa_ptr->tag_ops
      and we would introduce unnecessary performance penalty going through yet
      another indirection, so keep those right where they are.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      357f203b
    • V
      net: dsa: document the existing switch tree notifiers and add a new one · 886f8e26
      Vladimir Oltean 提交于
      The existence of dsa_broadcast has generated some confusion in the past:
      https://www.mail-archive.com/netdev@vger.kernel.org/msg365042.html
      
      So let's document the existing dsa_port_notify and dsa_broadcast
      functions and explain when each of them should be used.
      
      Also, in fact, the in-between function has always been there but was
      lacking a name, and is the main reason for this patch: dsa_tree_notify.
      Refactor dsa_broadcast to use it.
      
      This patch also moves dsa_broadcast (a top-level function) to dsa2.c,
      where it really belonged in the first place, but had no companion so it
      stood with dsa_port_notify.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      886f8e26
  26. 16 1月, 2021 2 次提交
    • V
      net: dsa: add ops for devlink-sb · 2a6ef763
      Vladimir Oltean 提交于
      Switches that care about QoS might have hardware support for reserving
      buffer pools for individual ports or traffic classes, and configuring
      their sizes and thresholds. Through devlink-sb (shared buffers), this is
      all configurable, as well as their occupancy being viewable.
      
      Add the plumbing in DSA for these operations.
      
      Individual drivers still need to call devlink_sb_register() with the
      shared buffers they want to expose. A helper was not created in DSA for
      this purpose (unlike, say, dsa_devlink_params_register), since in my
      opinion it does not bring any benefit over plainly calling
      devlink_sb_register() directly.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      2a6ef763
    • V
      net: dsa: set configure_vlan_while_not_filtering to true by default · 0ee2af4e
      Vladimir Oltean 提交于
      As explained in commit 54a0ed0d ("net: dsa: provide an option for
      drivers to always receive bridge VLANs"), DSA has historically been
      skipping VLAN switchdev operations when the bridge wasn't in
      vlan_filtering mode, but the reason why it was doing that has never been
      clear. So the configure_vlan_while_not_filtering option is there merely
      to preserve functionality for existing drivers. It isn't some behavior
      that drivers should opt into. Ideally, when all drivers leave this flag
      set, we can delete the dsa_port_skip_vlan_configuration() function.
      
      New drivers always seem to omit setting this flag, for some reason. So
      let's reverse the logic: the DSA core sets it by default to true before
      the .setup() callback, and legacy drivers can turn it off. This way, new
      drivers get the new behavior by default, unless they explicitly set the
      flag to false, which is more obvious during review.
      
      Remove the assignment from drivers which were setting it to true, and
      add the assignment to false for the drivers that didn't previously have
      it. This way, it should be easier to see how many we have left.
      
      The following drivers: lan9303, mv88e6060 were skipped from setting this
      flag to false, because they didn't have any VLAN offload ops in the
      first place.
      
      The Broadcom Starfighter 2 driver calls the common b53_switch_alloc and
      therefore also inherits the configure_vlan_while_not_filtering=true
      behavior.
      
      Also, print a message through netlink extack every time a VLAN has been
      skipped. This is mildly annoying on purpose, so that (a) it is at least
      clear that VLANs are being skipped - the legacy behavior in itself is
      confusing, and the extack should be much more difficult to miss, unlike
      kernel logs - and (b) people have one more incentive to convert to the
      new behavior.
      
      No behavior change except for the added prints is intended at this time.
      
      $ ip link add br0 type bridge vlan_filtering 0
      $ ip link set sw0p2 master br0
      [   60.315148] br0: port 1(sw0p2) entered blocking state
      [   60.320350] br0: port 1(sw0p2) entered disabled state
      [   60.327839] device sw0p2 entered promiscuous mode
      [   60.334905] br0: port 1(sw0p2) entered blocking state
      [   60.340142] br0: port 1(sw0p2) entered forwarding state
      Warning: dsa_core: skipping configuration of VLAN. # This was the pvid
      $ bridge vlan add dev sw0p2 vid 100
      Warning: dsa_core: skipping configuration of VLAN.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NKurt Kanzenbach <kurt@linutronix.de>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Link: https://lore.kernel.org/r/20210115231919.43834-1-vladimir.oltean@nxp.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      0ee2af4e
  27. 15 1月, 2021 1 次提交
    • T
      net: dsa: Link aggregation support · 058102a6
      Tobias Waldekranz 提交于
      Monitor the following events and notify the driver when:
      
      - A DSA port joins/leaves a LAG.
      - A LAG, made up of DSA ports, joins/leaves a bridge.
      - A DSA port in a LAG is enabled/disabled (enabled meaning
        "distributing" in 802.3ad LACP terms).
      
      When a LAG joins a bridge, the DSA subsystem will treat that as each
      individual port joining the bridge. The driver may look at the port's
      LAG device pointer to see if it is associated with any LAG, if that is
      required. This is analogue to how switchdev events are replicated out
      to all lower devices when reaching e.g. a LAG.
      
      Drivers can optionally request that DSA maintain a linear mapping from
      a LAG ID to the corresponding netdev by setting ds->num_lag_ids to the
      desired size.
      
      In the event that the hardware is not capable of offloading a
      particular LAG for any reason (the typical case being use of exotic
      modes like broadcast), DSA will take a hands-off approach, allowing
      the LAG to be formed as a pure software construct. This is reported
      back through the extended ACK, but is otherwise transparent to the
      user.
      Signed-off-by: NTobias Waldekranz <tobias@waldekranz.com>
      Reviewed-by: NVladimir Oltean <olteanv@gmail.com>
      Tested-by: NVladimir Oltean <olteanv@gmail.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      058102a6