1. 25 2月, 2022 2 次提交
  2. 16 2月, 2022 2 次提交
    • V
      net: dsa: offload bridge port VLANs on foreign interfaces · 164f861b
      Vladimir Oltean 提交于
      DSA now explicitly handles VLANs installed with the 'self' flag on the
      bridge as host VLANs, instead of just replicating every bridge port VLAN
      also on the CPU port and never deleting it, which is what it did before.
      
      However, this leaves a corner case uncovered, as explained by
      Tobias Waldekranz:
      https://patchwork.kernel.org/project/netdevbpf/patch/20220209213044.2353153-6-vladimir.oltean@nxp.com/#24735260
      
      Forwarding towards a bridge port VLAN installed on a bridge port foreign
      to DSA (separate NIC, Wi-Fi AP) used to work by virtue of the fact that
      DSA itself needed to have at least one port in that VLAN (therefore, it
      also had the CPU port in said VLAN). However, now that the CPU port may
      not be member of all VLANs that user ports are members of, we need to
      ensure this isn't the case if software forwarding to a foreign interface
      is required.
      
      The solution is to treat bridge port VLANs on standalone interfaces in
      the exact same way as host VLANs. From DSA's perspective, there is no
      difference between local termination and software forwarding; packets in
      that VLAN must reach the CPU in both cases.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      164f861b
    • V
      net: dsa: add explicit support for host bridge VLANs · 134ef238
      Vladimir Oltean 提交于
      Currently, DSA programs VLANs on shared (DSA and CPU) ports each time it
      does so on user ports. This is good for basic functionality but has
      several limitations:
      
      - the VLAN group which must reach the CPU may be radically different
        from the VLAN group that must be autonomously forwarded by the switch.
        In other words, the admin may want to isolate noisy stations and avoid
        traffic from them going to the control processor of the switch, where
        it would just waste useless cycles. The bridge already supports
        independent control of VLAN groups on bridge ports and on the bridge
        itself, and when VLAN-aware, it will drop packets in software anyway
        if their VID isn't added as a 'self' entry towards the bridge device.
      
      - Replaying host FDB entries may depend, for some drivers like mv88e6xxx,
        on replaying the host VLANs as well. The 2 VLAN groups are
        approximately the same in most regular cases, but there are corner
        cases when timing matters, and DSA's approximation of replicating
        VLANs on shared ports simply does not work.
      
      - If a user makes the bridge (implicitly the CPU port) join a VLAN by
        accident, there is no way for the CPU port to isolate itself from that
        noisy VLAN except by rebooting the system. This is because for each
        VLAN added on a user port, DSA will add it on shared ports too, but
        for each VLAN deletion on a user port, it will remain installed on
        shared ports, since DSA has no good indication of whether the VLAN is
        still in use or not.
      
      Now that the bridge driver emits well-balanced SWITCHDEV_OBJ_ID_PORT_VLAN
      addition and removal events, DSA has a simple and straightforward task
      of separating the bridge port VLANs (these have an orig_dev which is a
      DSA slave interface, or a LAG interface) from the host VLANs (these have
      an orig_dev which is a bridge interface), and to keep a simple reference
      count of each VID on each shared port.
      
      Forwarding VLANs must be installed on the bridge ports and on all DSA
      ports interconnecting them. We don't have a good view of the exact
      topology, so we simply install forwarding VLANs on all DSA ports, which
      is what has been done until now.
      
      Host VLANs must be installed primarily on the dedicated CPU port of each
      bridge port. More subtly, they must also be installed on upstream-facing
      and downstream-facing DSA ports that are connecting the bridge ports and
      the CPU. This ensures that the mv88e6xxx's problem (VID of host FDB
      entry may be absent from VTU) is still addressed even if that switch is
      in a cross-chip setup, and it has no local CPU port.
      
      Therefore:
      - user ports contain only bridge port (forwarding) VLANs, and no
        refcounting is necessary
      - DSA ports contain both forwarding and host VLANs. Refcounting is
        necessary among these 2 types.
      - CPU ports contain only host VLANs. Refcounting is also necessary.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      134ef238
  3. 09 2月, 2022 1 次提交
    • V
      net: dsa: fix panic when DSA master device unbinds on shutdown · ee534378
      Vladimir Oltean 提交于
      Rafael reports that on a system with LX2160A and Marvell DSA switches,
      if a reboot occurs while the DSA master (dpaa2-eth) is up, the following
      panic can be seen:
      
      systemd-shutdown[1]: Rebooting.
      Unable to handle kernel paging request at virtual address 00a0000800000041
      [00a0000800000041] address between user and kernel address ranges
      Internal error: Oops: 96000004 [#1] PREEMPT SMP
      CPU: 6 PID: 1 Comm: systemd-shutdow Not tainted 5.16.5-00042-g8f5585009b24 #32
      pc : dsa_slave_netdevice_event+0x130/0x3e4
      lr : raw_notifier_call_chain+0x50/0x6c
      Call trace:
       dsa_slave_netdevice_event+0x130/0x3e4
       raw_notifier_call_chain+0x50/0x6c
       call_netdevice_notifiers_info+0x54/0xa0
       __dev_close_many+0x50/0x130
       dev_close_many+0x84/0x120
       unregister_netdevice_many+0x130/0x710
       unregister_netdevice_queue+0x8c/0xd0
       unregister_netdev+0x20/0x30
       dpaa2_eth_remove+0x68/0x190
       fsl_mc_driver_remove+0x20/0x5c
       __device_release_driver+0x21c/0x220
       device_release_driver_internal+0xac/0xb0
       device_links_unbind_consumers+0xd4/0x100
       __device_release_driver+0x94/0x220
       device_release_driver+0x28/0x40
       bus_remove_device+0x118/0x124
       device_del+0x174/0x420
       fsl_mc_device_remove+0x24/0x40
       __fsl_mc_device_remove+0xc/0x20
       device_for_each_child+0x58/0xa0
       dprc_remove+0x90/0xb0
       fsl_mc_driver_remove+0x20/0x5c
       __device_release_driver+0x21c/0x220
       device_release_driver+0x28/0x40
       bus_remove_device+0x118/0x124
       device_del+0x174/0x420
       fsl_mc_bus_remove+0x80/0x100
       fsl_mc_bus_shutdown+0xc/0x1c
       platform_shutdown+0x20/0x30
       device_shutdown+0x154/0x330
       __do_sys_reboot+0x1cc/0x250
       __arm64_sys_reboot+0x20/0x30
       invoke_syscall.constprop.0+0x4c/0xe0
       do_el0_svc+0x4c/0x150
       el0_svc+0x24/0xb0
       el0t_64_sync_handler+0xa8/0xb0
       el0t_64_sync+0x178/0x17c
      
      It can be seen from the stack trace that the problem is that the
      deregistration of the master causes a dev_close(), which gets notified
      as NETDEV_GOING_DOWN to dsa_slave_netdevice_event().
      But dsa_switch_shutdown() has already run, and this has unregistered the
      DSA slave interfaces, and yet, the NETDEV_GOING_DOWN handler attempts to
      call dev_close_many() on those slave interfaces, leading to the problem.
      
      The previous attempt to avoid the NETDEV_GOING_DOWN on the master after
      dsa_switch_shutdown() was called seems improper. Unregistering the slave
      interfaces is unnecessary and unhelpful. Instead, after the slaves have
      stopped being uppers of the DSA master, we can now reset to NULL the
      master->dsa_ptr pointer, which will make DSA start ignoring all future
      notifier events on the master.
      
      Fixes: 0650bf52 ("net: dsa: be compatible with masters which unregister on shutdown")
      Reported-by: NRafael Richter <rafael.richter@gin.de>
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ee534378
  4. 02 2月, 2022 2 次提交
    • V
      net: dsa: replay master state events in dsa_tree_{setup,teardown}_master · e83d5653
      Vladimir Oltean 提交于
      In order for switch driver to be able to make simple and reliable use of
      the master tracking operations, they must also be notified of the
      initial state of the DSA master, not just of the changes. This is
      because they might enable certain features only during the time when
      they know that the DSA master is up and running.
      
      Therefore, this change explicitly checks the state of the DSA master
      under the same rtnl_mutex as we were holding during the
      dsa_master_setup() and dsa_master_teardown() call. The idea being that
      if the DSA master became operational in between the moment in which it
      became a DSA master (dsa_master_setup set dev->dsa_ptr) and the moment
      when we checked for the master being up, there is a chance that we
      would emit a ->master_state_change() call with no actual state change.
      We need to avoid that by serializing the concurrent netdevice event with
      us. If the netdevice event started before, we force it to finish before
      we begin, because we take rtnl_lock before making netdev_uses_dsa()
      return true. So we also handle that early event and do nothing on it.
      Similarly, if the dev_open() attempt is concurrent with us, it will
      attempt to take the rtnl_mutex, but we're holding it. We'll see that
      the master flag IFF_UP isn't set, then when we release the rtnl_mutex
      we'll process the NETDEV_UP notifier.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NAnsuel Smith <ansuelsmth@gmail.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e83d5653
    • V
      net: dsa: provide switch operations for tracking the master state · 295ab96f
      Vladimir Oltean 提交于
      Certain drivers may need to send management traffic to the switch for
      things like register access, FDB dump, etc, to accelerate what their
      slow bus (SPI, I2C, MDIO) can already do.
      
      Ethernet is faster (especially in bulk transactions) but is also more
      unreliable, since the user may decide to bring the DSA master down (or
      not bring it up), therefore severing the link between the host and the
      attached switch.
      
      Drivers needing Ethernet-based register access already should have
      fallback logic to the slow bus if the Ethernet method fails, but that
      fallback may be based on a timeout, and the I/O to the switch may slow
      down to a halt if the master is down, because every Ethernet packet will
      have to time out. The driver also doesn't have the option to turn off
      Ethernet-based I/O momentarily, because it wouldn't know when to turn it
      back on.
      
      Which is where this change comes in. By tracking NETDEV_CHANGE,
      NETDEV_UP and NETDEV_GOING_DOWN events on the DSA master, we should know
      the exact interval of time during which this interface is reliably
      available for traffic. Provide this information to switches so they can
      use it as they wish.
      
      An helper is added dsa_port_master_is_operational() to check if a master
      port is operational.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NAnsuel Smith <ansuelsmth@gmail.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      295ab96f
  5. 06 1月, 2022 3 次提交
    • V
      net: dsa: setup master before ports · 11fd667d
      Vladimir Oltean 提交于
      It is said that as soon as a network interface is registered, all its
      resources should have already been prepared, so that it is available for
      sending and receiving traffic. One of the resources needed by a DSA
      slave interface is the master.
      
      dsa_tree_setup
      -> dsa_tree_setup_ports
         -> dsa_port_setup
            -> dsa_slave_create
               -> register_netdevice
      -> dsa_tree_setup_master
         -> dsa_master_setup
            -> sets up master->dsa_ptr, which enables reception
      
      Therefore, there is a short period of time after register_netdevice()
      during which the master isn't prepared to pass traffic to the DSA layer
      (master->dsa_ptr is checked by eth_type_trans). Same thing during
      unregistration, there is a time frame in which packets might be missed.
      
      Note that this change opens us to another race: dsa_master_find_slave()
      will get invoked potentially earlier than the slave creation, and later
      than the slave deletion. Since dp->slave starts off as a NULL pointer,
      the earlier calls aren't a problem, but the later calls are. To avoid
      use-after-free, we should zeroize dp->slave before calling
      dsa_slave_destroy().
      
      In practice I cannot really test real life improvements brought by this
      change, since in my systems, netdevice creation races with PHY autoneg
      which takes a few seconds to complete, and that masks quite a few races.
      Effects might be noticeable in a setup with fixed links all the way to
      an external system.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      11fd667d
    • V
      net: dsa: first set up shared ports, then non-shared ports · 1e3f407f
      Vladimir Oltean 提交于
      After commit a57d8c21 ("net: dsa: flush switchdev workqueue before
      tearing down CPU/DSA ports"), the port setup and teardown procedure
      became asymmetric.
      
      The fact of the matter is that user ports need the shared ports to be up
      before they can be used for CPU-initiated termination. And since we
      register net devices for the user ports, those won't be functional until
      we also call the setup for the shared (CPU, DSA) ports. But we may do
      that later, depending on the port numbering scheme of the hardware we
      are dealing with.
      
      It just makes sense that all shared ports are brought up before any user
      port is. I can't pinpoint any issue due to the current behavior, but
      let's change it nonetheless, for consistency's sake.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1e3f407f
    • V
      net: dsa: hold rtnl_mutex when calling dsa_master_{setup,teardown} · c146f9bc
      Vladimir Oltean 提交于
      DSA needs to simulate master tracking events when a binding is first
      with a DSA master established and torn down, in order to give drivers
      the simplifying guarantee that ->master_state_change calls are made
      only when the master's readiness state to pass traffic changes.
      master_state_change() provide a operational bool that DSA driver can use
      to understand if DSA master is operational or not.
      To avoid races, we need to block the reception of
      NETDEV_UP/NETDEV_CHANGE/NETDEV_GOING_DOWN events in the netdev notifier
      chain while we are changing the master's dev->dsa_ptr (this changes what
      netdev_uses_dsa(dev) reports).
      
      The dsa_master_setup() and dsa_master_teardown() functions optionally
      require the rtnl_mutex to be held, if the tagger needs the master to be
      promiscuous, these functions call dev_set_promiscuity(). Move the
      rtnl_lock() from that function and make it top-level.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c146f9bc
  6. 05 1月, 2022 1 次提交
    • V
      net: dsa: make dsa_switch :: num_ports an unsigned int · 258030ac
      Vladimir Oltean 提交于
      Currently, num_ports is declared as size_t, which is defined as
      __kernel_ulong_t, therefore it occupies 8 bytes of memory.
      
      Even switches with port numbers in the range of tens are exotic, so
      there is no need for this amount of storage.
      
      Additionally, because the max_num_bridges member right above it is also
      4 bytes, it means the compiler needs to add padding between the last 2
      fields. By reducing the size, we don't need that padding and can reduce
      the struct size.
      
      Before:
      
      pahole -C dsa_switch net/dsa/slave.o
      struct dsa_switch {
              struct device *            dev;                  /*     0     8 */
              struct dsa_switch_tree *   dst;                  /*     8     8 */
              unsigned int               index;                /*    16     4 */
              u32                        setup:1;              /*    20: 0  4 */
              u32                        vlan_filtering_is_global:1; /*    20: 1  4 */
              u32                        needs_standalone_vlan_filtering:1; /*    20: 2  4 */
              u32                        configure_vlan_while_not_filtering:1; /*    20: 3  4 */
              u32                        untag_bridge_pvid:1;  /*    20: 4  4 */
              u32                        assisted_learning_on_cpu_port:1; /*    20: 5  4 */
              u32                        vlan_filtering:1;     /*    20: 6  4 */
              u32                        pcs_poll:1;           /*    20: 7  4 */
              u32                        mtu_enforcement_ingress:1; /*    20: 8  4 */
      
              /* XXX 23 bits hole, try to pack */
      
              struct notifier_block      nb;                   /*    24    24 */
      
              /* XXX last struct has 4 bytes of padding */
      
              void *                     priv;                 /*    48     8 */
              void *                     tagger_data;          /*    56     8 */
              /* --- cacheline 1 boundary (64 bytes) --- */
              struct dsa_chip_data *     cd;                   /*    64     8 */
              const struct dsa_switch_ops  * ops;              /*    72     8 */
              u32                        phys_mii_mask;        /*    80     4 */
      
              /* XXX 4 bytes hole, try to pack */
      
              struct mii_bus *           slave_mii_bus;        /*    88     8 */
              unsigned int               ageing_time_min;      /*    96     4 */
              unsigned int               ageing_time_max;      /*   100     4 */
              struct dsa_8021q_context * tag_8021q_ctx;        /*   104     8 */
              struct devlink *           devlink;              /*   112     8 */
              unsigned int               num_tx_queues;        /*   120     4 */
              unsigned int               num_lag_ids;          /*   124     4 */
              /* --- cacheline 2 boundary (128 bytes) --- */
              unsigned int               max_num_bridges;      /*   128     4 */
      
              /* XXX 4 bytes hole, try to pack */
      
              size_t                     num_ports;            /*   136     8 */
      
              /* size: 144, cachelines: 3, members: 27 */
              /* sum members: 132, holes: 2, sum holes: 8 */
              /* sum bitfield members: 9 bits, bit holes: 1, sum bit holes: 23 bits */
              /* paddings: 1, sum paddings: 4 */
              /* last cacheline: 16 bytes */
      };
      
      After:
      
      pahole -C dsa_switch net/dsa/slave.o
      struct dsa_switch {
              struct device *            dev;                  /*     0     8 */
              struct dsa_switch_tree *   dst;                  /*     8     8 */
              unsigned int               index;                /*    16     4 */
              u32                        setup:1;              /*    20: 0  4 */
              u32                        vlan_filtering_is_global:1; /*    20: 1  4 */
              u32                        needs_standalone_vlan_filtering:1; /*    20: 2  4 */
              u32                        configure_vlan_while_not_filtering:1; /*    20: 3  4 */
              u32                        untag_bridge_pvid:1;  /*    20: 4  4 */
              u32                        assisted_learning_on_cpu_port:1; /*    20: 5  4 */
              u32                        vlan_filtering:1;     /*    20: 6  4 */
              u32                        pcs_poll:1;           /*    20: 7  4 */
              u32                        mtu_enforcement_ingress:1; /*    20: 8  4 */
      
              /* XXX 23 bits hole, try to pack */
      
              struct notifier_block      nb;                   /*    24    24 */
      
              /* XXX last struct has 4 bytes of padding */
      
              void *                     priv;                 /*    48     8 */
              void *                     tagger_data;          /*    56     8 */
              /* --- cacheline 1 boundary (64 bytes) --- */
              struct dsa_chip_data *     cd;                   /*    64     8 */
              const struct dsa_switch_ops  * ops;              /*    72     8 */
              u32                        phys_mii_mask;        /*    80     4 */
      
              /* XXX 4 bytes hole, try to pack */
      
              struct mii_bus *           slave_mii_bus;        /*    88     8 */
              unsigned int               ageing_time_min;      /*    96     4 */
              unsigned int               ageing_time_max;      /*   100     4 */
              struct dsa_8021q_context * tag_8021q_ctx;        /*   104     8 */
              struct devlink *           devlink;              /*   112     8 */
              unsigned int               num_tx_queues;        /*   120     4 */
              unsigned int               num_lag_ids;          /*   124     4 */
              /* --- cacheline 2 boundary (128 bytes) --- */
              unsigned int               max_num_bridges;      /*   128     4 */
              unsigned int               num_ports;            /*   132     4 */
      
              /* size: 136, cachelines: 3, members: 27 */
              /* sum members: 128, holes: 1, sum holes: 4 */
              /* sum bitfield members: 9 bits, bit holes: 1, sum bit holes: 23 bits */
              /* paddings: 1, sum paddings: 4 */
              /* last cacheline: 8 bytes */
      };
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      258030ac
  7. 14 12月, 2021 1 次提交
    • V
      net: dsa: make tagging protocols connect to individual switches from a tree · 7f297314
      Vladimir Oltean 提交于
      On the NXP Bluebox 3 board which uses a multi-switch setup with sja1105,
      the mechanism through which the tagger connects to the switch tree is
      broken, due to improper DSA code design. At the time when tag_ops->connect()
      is called in dsa_port_parse_cpu(), DSA hasn't finished "touching" all
      the ports, so it doesn't know how large the tree is and how many ports
      it has. It has just seen the first CPU port by this time. As a result,
      this function will call the tagger's ->connect method too early, and the
      tagger will connect only to the first switch from the tree.
      
      This could be perhaps addressed a bit more simply by just moving the
      tag_ops->connect(dst) call a bit later (for example in dsa_tree_setup),
      but there is already a design inconsistency at present: on the switch
      side, the notification is on a per-switch basis, but on the tagger side,
      it is on a per-tree basis. Furthermore, the persistent storage itself is
      per switch (ds->tagger_data). And the tagger connect and disconnect
      procedures (at least the ones that exist currently) could see a fair bit
      of simplification if they didn't have to iterate through the switches of
      a tree.
      
      To fix the issue, this change transforms tag_ops->connect(dst) into
      tag_ops->connect(ds) and moves it somewhere where we already iterate
      over all switches of a tree. That is in dsa_switch_setup_tag_protocol(),
      which is a good placement because we already have there the connection
      call to the switch side of things.
      
      As for the dsa_tree_bind_tag_proto() method (called from the code path
      that changes the tag protocol), things are a bit more complicated
      because we receive the tree as argument, yet when we unwind on errors,
      it would be nice to not call tag_ops->disconnect(ds) where we didn't
      previously call tag_ops->connect(ds). We didn't have this problem before
      because the tag_ops connection operations passed the entire dst before,
      and this is more fine grained now. To solve the error rewind case using
      the new API, we have to create yet one more cross-chip notifier for
      disconnection, and stay connected with the old tag protocol to all the
      switches in the tree until we've succeeded to connect with the new one
      as well. So if something fails half way, the whole tree is still
      connected to the old tagger. But there may still be leaks if the tagger
      fails to connect to the 2nd out of 3 switches in a tree: somebody needs
      to tell the tagger to disconnect from the first switch. Nothing comes
      for free, and this was previously handled privately by the tagging
      protocol driver before, but now we need to emit a disconnect cross-chip
      notifier for that, because DSA has to take care of the unwind path. We
      assume that the tagging protocol has connected to a switch if it has set
      ds->tagger_data to something, otherwise we avoid calling its
      disconnection method in the error rewind path.
      
      The rest of the changes are in the tagging protocol drivers, and have to
      do with the replacement of dst with ds. The iteration is removed and the
      error unwind path is simplified, as mentioned above.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7f297314
  8. 12 12月, 2021 1 次提交
    • V
      net: dsa: introduce tagger-owned storage for private and shared data · dc452a47
      Vladimir Oltean 提交于
      Ansuel is working on register access over Ethernet for the qca8k switch
      family. This requires the qca8k tagging protocol driver to receive
      frames which aren't intended for the network stack, but instead for the
      qca8k switch driver itself.
      
      The dp->priv is currently the prevailing method for passing data back
      and forth between the tagging protocol driver and the switch driver.
      However, this method is riddled with caveats.
      
      The DSA design allows in principle for any switch driver to return any
      protocol it desires in ->get_tag_protocol(). The dsa_loop driver can be
      modified to do just that. But in the current design, the memory behind
      dp->priv has to be allocated by the switch driver, so if the tagging
      protocol is paired to an unexpected switch driver, we may end up in NULL
      pointer dereferences inside the kernel, or worse (a switch driver may
      allocate dp->priv according to the expectations of a different tagger).
      
      The latter possibility is even more plausible considering that DSA
      switches can dynamically change tagging protocols in certain cases
      (dsa <-> edsa, ocelot <-> ocelot-8021q), and the current design lends
      itself to mistakes that are all too easy to make.
      
      This patch proposes that the tagging protocol driver should manage its
      own memory, instead of relying on the switch driver to do so.
      After analyzing the different in-tree needs, it can be observed that the
      required tagger storage is per switch, therefore a ds->tagger_data
      pointer is introduced. In principle, per-port storage could also be
      introduced, although there is no need for it at the moment. Future
      changes will replace the current usage of dp->priv with ds->tagger_data.
      
      We define a "binding" event between the DSA switch tree and the tagging
      protocol. During this binding event, the tagging protocol's ->connect()
      method is called first, and this may allocate some memory for each
      switch of the tree. Then a cross-chip notifier is emitted for the
      switches within that tree, and they are given the opportunity to fix up
      the tagger's memory (for example, they might set up some function
      pointers that represent virtual methods for consuming packets).
      Because the memory is owned by the tagger, there exists a ->disconnect()
      method for the tagger (which is the place to free the resources), but
      there doesn't exist a ->disconnect() method for the switch driver.
      This is part of the design. The switch driver should make minimal use of
      the public part of the tagger data, and only after type-checking it
      using the supplied "proto" argument.
      
      In the code there are in fact two binding events, one is the initial
      event in dsa_switch_setup_tag_protocol(). At this stage, the cross chip
      notifier chains aren't initialized, so we call each switch's connect()
      method by hand. Then there is dsa_tree_bind_tag_proto() during
      dsa_tree_change_tag_proto(), and here we have an old protocol and a new
      one. We first connect to the new one before disconnecting from the old
      one, to simplify error handling a bit and to ensure we remain in a valid
      state at all times.
      Co-developed-by: NAnsuel Smith <ansuelsmth@gmail.com>
      Signed-off-by: NAnsuel Smith <ansuelsmth@gmail.com>
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dc452a47
  9. 09 12月, 2021 3 次提交
    • V
      net: dsa: keep the bridge_dev and bridge_num as part of the same structure · d3eed0e5
      Vladimir Oltean 提交于
      The main desire behind this is to provide coherent bridge information to
      the fast path without locking.
      
      For example, right now we set dp->bridge_dev and dp->bridge_num from
      separate code paths, it is theoretically possible for a packet
      transmission to read these two port properties consecutively and find a
      bridge number which does not correspond with the bridge device.
      
      Another desire is to start passing more complex bridge information to
      dsa_switch_ops functions. For example, with FDB isolation, it is
      expected that drivers will need to be passed the bridge which requested
      an FDB/MDB entry to be offloaded, and along with that bridge_dev, the
      associated bridge_num should be passed too, in case the driver might
      want to implement an isolation scheme based on that number.
      
      We already pass the {bridge_dev, bridge_num} pair to the TX forwarding
      offload switch API, however we'd like to remove that and squash it into
      the basic bridge join/leave API. So that means we need to pass this
      pair to the bridge join/leave API.
      
      During dsa_port_bridge_leave, first we unset dp->bridge_dev, then we
      call the driver's .port_bridge_leave with what used to be our
      dp->bridge_dev, but provided as an argument.
      
      When bridge_dev and bridge_num get folded into a single structure, we
      need to preserve this behavior in dsa_port_bridge_leave: we need a copy
      of what used to be in dp->bridge.
      
      Switch drivers check bridge membership by comparing dp->bridge_dev with
      the provided bridge_dev, but now, if we provide the struct dsa_bridge as
      a pointer, they cannot keep comparing dp->bridge to the provided
      pointer, since this only points to an on-stack copy. To make this
      obvious and prevent driver writers from forgetting and doing stupid
      things, in this new API, the struct dsa_bridge is provided as a full
      structure (not very large, contains an int and a pointer) instead of a
      pointer. An explicit comparison function needs to be used to determine
      bridge membership: dsa_port_offloads_bridge().
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NAlvin Šipraga <alsi@bang-olufsen.dk>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      d3eed0e5
    • V
      net: dsa: assign a bridge number even without TX forwarding offload · 947c8746
      Vladimir Oltean 提交于
      The service where DSA assigns a unique bridge number for each forwarding
      domain is useful even for drivers which do not implement the TX
      forwarding offload feature.
      
      For example, drivers might use the dp->bridge_num for FDB isolation.
      
      So rename ds->num_fwd_offloading_bridges to ds->max_num_bridges, and
      calculate a unique bridge_num for all drivers that set this value.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NAlvin Šipraga <alsi@bang-olufsen.dk>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      947c8746
    • V
      net: dsa: make dp->bridge_num one-based · 3f9bb030
      Vladimir Oltean 提交于
      I have seen too many bugs already due to the fact that we must encode an
      invalid dp->bridge_num as a negative value, because the natural tendency
      is to check that invalid value using (!dp->bridge_num). Latest example
      can be seen in commit 1bec0f05 ("net: dsa: fix bridge_num not
      getting cleared after ports leaving the bridge").
      
      Convert the existing users to assume that dp->bridge_num == 0 is the
      encoding for invalid, and valid bridge numbers start from 1.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NAlvin Šipraga <alsi@bang-olufsen.dk>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      3f9bb030
  10. 25 10月, 2021 2 次提交
    • V
      net: dsa: introduce locking for the address lists on CPU and DSA ports · 338a3a47
      Vladimir Oltean 提交于
      Now that the rtnl_mutex is going away for dsa_port_{host_,}fdb_{add,del},
      no one is serializing access to the address lists that DSA keeps for the
      purpose of reference counting on shared ports (CPU and cascade ports).
      
      It can happen for one dsa_switch_do_fdb_del to do list_del on a dp->fdbs
      element while another dsa_switch_do_fdb_{add,del} is traversing dp->fdbs.
      We need to avoid that.
      
      Currently dp->mdbs is not at risk, because dsa_switch_do_mdb_{add,del}
      still runs under the rtnl_mutex. But it would be nice if it would not
      depend on that being the case. So let's introduce a mutex per port (the
      address lists are per port too) and share it between dp->mdbs and
      dp->fdbs.
      
      The place where we put the locking is interesting. It could be tempting
      to put a DSA-level lock which still serializes calls to
      .port_fdb_{add,del}, but it would still not avoid concurrency with other
      driver code paths that are currently under rtnl_mutex (.port_fdb_dump,
      .port_fast_age). So it would add a very false sense of security (and
      adding a global switch-wide lock in DSA to resynchronize with the
      rtnl_lock is also counterproductive and hard).
      
      So the locking is intentionally done only where the dp->fdbs and dp->mdbs
      lists are traversed. That means, from a driver perspective, that
      .port_fdb_add will be called with the dp->addr_lists_lock mutex held on
      the CPU port, but not held on user ports. This is done so that driver
      writers are not encouraged to rely on any guarantee offered by
      dp->addr_lists_lock.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      338a3a47
    • D
      Revert "Merge branch 'dsa-rtnl'" · 2d7e73f0
      David S. Miller 提交于
      This reverts commit 965e6b26, reversing
      changes made to 4d98bb0d.
      2d7e73f0
  11. 24 10月, 2021 1 次提交
    • V
      net: dsa: introduce locking for the address lists on CPU and DSA ports · d3bd8924
      Vladimir Oltean 提交于
      Now that the rtnl_mutex is going away for dsa_port_{host_,}fdb_{add,del},
      no one is serializing access to the address lists that DSA keeps for the
      purpose of reference counting on shared ports (CPU and cascade ports).
      
      It can happen for one dsa_switch_do_fdb_del to do list_del on a dp->fdbs
      element while another dsa_switch_do_fdb_{add,del} is traversing dp->fdbs.
      We need to avoid that.
      
      Currently dp->mdbs is not at risk, because dsa_switch_do_mdb_{add,del}
      still runs under the rtnl_mutex. But it would be nice if it would not
      depend on that being the case. So let's introduce a mutex per port (the
      address lists are per port too) and share it between dp->mdbs and
      dp->fdbs.
      
      The place where we put the locking is interesting. It could be tempting
      to put a DSA-level lock which still serializes calls to
      .port_fdb_{add,del}, but it would still not avoid concurrency with other
      driver code paths that are currently under rtnl_mutex (.port_fdb_dump,
      .port_fast_age). So it would add a very false sense of security (and
      adding a global switch-wide lock in DSA to resynchronize with the
      rtnl_lock is also counterproductive and hard).
      
      So the locking is intentionally done only where the dp->fdbs and dp->mdbs
      lists are traversed. That means, from a driver perspective, that
      .port_fdb_add will be called with the dp->addr_lists_lock mutex held on
      the CPU port, but not held on user ports. This is done so that driver
      writers are not encouraged to rely on any guarantee offered by
      dp->addr_lists_lock.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d3bd8924
  12. 21 10月, 2021 2 次提交
  13. 20 10月, 2021 1 次提交
  14. 14 10月, 2021 1 次提交
  15. 09 10月, 2021 2 次提交
    • V
      net: dsa: hold rtnl_lock in dsa_switch_setup_tag_protocol · 1951b3f1
      Vladimir Oltean 提交于
      It was a documented fact that ds->ops->change_tag_protocol() offered
      rtnetlink mutex protection to the switch driver, since there was an
      ASSERT_RTNL right before the call in dsa_switch_change_tag_proto()
      (initiated from sysfs).
      
      The blamed commit introduced another call path for
      ds->ops->change_tag_protocol() which does not hold the rtnl_mutex.
      This is:
      
      dsa_tree_setup
      -> dsa_tree_setup_switches
         -> dsa_switch_setup
            -> dsa_switch_setup_tag_protocol
               -> ds->ops->change_tag_protocol()
         -> dsa_port_setup
            -> dsa_slave_create
               -> register_netdevice(slave_dev)
      -> dsa_tree_setup_master
         -> dsa_master_setup
            -> dev->dsa_ptr = cpu_dp
      
      The reason why the rtnl_mutex is held in the sysfs call path is to
      ensure that, once the master and all the DSA interfaces are down (which
      is required so that no packets flow), they remain down during the
      tagging protocol change.
      
      The above calling order illustrates the fact that it should not be risky
      to change the initial tagging protocol to the one specified in the
      device tree at the given time:
      
      - packets cannot enter the dsa_switch_rcv() packet type handler since
        netdev_uses_dsa() for the master will not yet return true, since
        dev->dsa_ptr has not yet been populated
      
      - packets cannot enter the dsa_slave_xmit() function because no DSA
        interface has yet been registered
      
      So from the DSA core's perspective, holding the rtnl_mutex is indeed not
      necessary.
      
      Yet, drivers may need to do things which need rtnl_mutex protection. For
      example:
      
      felix_set_tag_protocol
      -> felix_setup_tag_8021q
         -> dsa_tag_8021q_register
            -> dsa_tag_8021q_setup
               -> dsa_tag_8021q_port_setup
                  -> vlan_vid_add
                     -> ASSERT_RTNL
      
      These drivers do not really have a choice to take the rtnl_mutex
      themselves, since in the sysfs case, the rtnl_mutex is already held.
      
      Fixes: deff7107 ("net: dsa: Allow default tag protocol to be overridden from DT")
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1951b3f1
    • V
      net: dsa: fix bridge_num not getting cleared after ports leaving the bridge · 1bec0f05
      Vladimir Oltean 提交于
      The dp->bridge_num is zero-based, with -1 being the encoding for an
      invalid value. But dsa_bridge_num_put used to check for an invalid value
      by comparing bridge_num with 0, which is of course incorrect.
      
      The result is that the bridge_num will never get cleared by
      dsa_bridge_num_put, and further port joins to other bridges will get a
      bridge_num larger than the previous one, and once all the available
      bridges with TX forwarding offload supported by the hardware get
      exhausted, the TX forwarding offload feature is simply disabled.
      
      In the case of sja1105, 7 iterations of the loop below are enough to
      exhaust the TX forwarding offload bits, and further bridge joins operate
      without that feature.
      
      ip link add br0 type bridge vlan_filtering 1
      
      while :; do
              ip link set sw0p2 master br0 && sleep 1
              ip link set sw0p2 nomaster && sleep 1
      done
      
      This issue is enough of an indication that having the dp->bridge_num
      invalid encoding be a negative number is prone to bugs, so this will be
      changed to a one-based value, with the dp->bridge_num of zero being the
      indication of no bridge. However, that is material for net-next.
      
      Fixes: f5e165e7 ("net: dsa: track unique bridge numbers across all DSA switch trees")
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      1bec0f05
  16. 27 9月, 2021 1 次提交
  17. 22 9月, 2021 1 次提交
  18. 21 9月, 2021 2 次提交
    • V
      net: dsa: don't allocate the slave_mii_bus using devres · 5135e96a
      Vladimir Oltean 提交于
      The Linux device model permits both the ->shutdown and ->remove driver
      methods to get called during a shutdown procedure. Example: a DSA switch
      which sits on an SPI bus, and the SPI bus driver calls this on its
      ->shutdown method:
      
      spi_unregister_controller
      -> device_for_each_child(&ctlr->dev, NULL, __unregister);
         -> spi_unregister_device(to_spi_device(dev));
            -> device_del(&spi->dev);
      
      So this is a simple pattern which can theoretically appear on any bus,
      although the only other buses on which I've been able to find it are
      I2C:
      
      i2c_del_adapter
      -> device_for_each_child(&adap->dev, NULL, __unregister_client);
         -> i2c_unregister_device(client);
            -> device_unregister(&client->dev);
      
      The implication of this pattern is that devices on these buses can be
      unregistered after having been shut down. The drivers for these devices
      might choose to return early either from ->remove or ->shutdown if the
      other callback has already run once, and they might choose that the
      ->shutdown method should only perform a subset of the teardown done by
      ->remove (to avoid unnecessary delays when rebooting).
      
      So in other words, the device driver may choose on ->remove to not
      do anything (therefore to not unregister an MDIO bus it has registered
      on ->probe), because this ->remove is actually triggered by the
      device_shutdown path, and its ->shutdown method has already run and done
      the minimally required cleanup.
      
      This used to be fine until the blamed commit, but now, the following
      BUG_ON triggers:
      
      void mdiobus_free(struct mii_bus *bus)
      {
      	/* For compatibility with error handling in drivers. */
      	if (bus->state == MDIOBUS_ALLOCATED) {
      		kfree(bus);
      		return;
      	}
      
      	BUG_ON(bus->state != MDIOBUS_UNREGISTERED);
      	bus->state = MDIOBUS_RELEASED;
      
      	put_device(&bus->dev);
      }
      
      In other words, there is an attempt to free an MDIO bus which was not
      unregistered. The attempt to free it comes from the devres release
      callbacks of the SPI device, which are executed after the device is
      unregistered.
      
      I'm not saying that the fact that MDIO buses allocated using devres
      would automatically get unregistered wasn't strange. I'm just saying
      that the commit didn't care about auditing existing call paths in the
      kernel, and now, the following code sequences are potentially buggy:
      
      (a) devm_mdiobus_alloc followed by plain mdiobus_register, for a device
          located on a bus that unregisters its children on shutdown. After
          the blamed patch, either both the alloc and the register should use
          devres, or none should.
      
      (b) devm_mdiobus_alloc followed by plain mdiobus_register, and then no
          mdiobus_unregister at all in the remove path. After the blamed
          patch, nobody unregisters the MDIO bus anymore, so this is even more
          buggy than the previous case which needs a specific bus
          configuration to be seen, this one is an unconditional bug.
      
      In this case, DSA falls into category (a), it tries to be helpful and
      registers an MDIO bus on behalf of the switch, which might be on such a
      bus. I've no idea why it does it under devres.
      
      It does this on probe:
      
      	if (!ds->slave_mii_bus && ds->ops->phy_read)
      		alloc and register mdio bus
      
      and this on remove:
      
      	if (ds->slave_mii_bus && ds->ops->phy_read)
      		unregister mdio bus
      
      I _could_ imagine using devres because the condition used on remove is
      different than the condition used on probe. So strictly speaking, DSA
      cannot determine whether the ds->slave_mii_bus it sees on remove is the
      ds->slave_mii_bus that _it_ has allocated on probe. Using devres would
      have solved that problem. But nonetheless, the existing code already
      proceeds to unregister the MDIO bus, even though it might be
      unregistering an MDIO bus it has never registered. So I can only guess
      that no driver that implements ds->ops->phy_read also allocates and
      registers ds->slave_mii_bus itself.
      
      So in that case, if unregistering is fine, freeing must be fine too.
      
      Stop using devres and free the MDIO bus manually. This will make devres
      stop attempting to free a still registered MDIO bus on ->shutdown.
      
      Fixes: ac3a68d5 ("net: phy: don't abuse devres in devm_mdiobus_register()")
      Reported-by: NLino Sanfilippo <LinoSanfilippo@gmx.de>
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Tested-by: NLino Sanfilippo <LinoSanfilippo@gmx.de>
      Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5135e96a
    • V
      net: dsa: fix dsa_tree_setup error path · e5845aa0
      Vladimir Oltean 提交于
      Since the blamed commit, dsa_tree_teardown_switches() was split into two
      smaller functions, dsa_tree_teardown_switches and dsa_tree_teardown_ports.
      
      However, the error path of dsa_tree_setup stopped calling dsa_tree_teardown_ports.
      
      Fixes: a57d8c21 ("net: dsa: flush switchdev workqueue before tearing down CPU/DSA ports")
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e5845aa0
  19. 19 9月, 2021 2 次提交
    • V
      net: dsa: tear down devlink port regions when tearing down the devlink port on error · fd292c18
      Vladimir Oltean 提交于
      Commit 86f8b1c0 ("net: dsa: Do not make user port errors fatal")
      decided it was fine to ignore errors on certain ports that fail to
      probe, and go on with the ports that do probe fine.
      
      Commit fb6ec87f ("net: dsa: Fix type was not set for devlink port")
      noticed that devlink_port_type_eth_set(dlp, dp->slave); does not get
      called, and devlink notices after a timeout of 3600 seconds and prints a
      WARN_ON. So it went ahead to unregister the devlink port. And because
      there exists an UNUSED port flavour, we actually re-register the devlink
      port as UNUSED.
      
      Commit 08156ba4 ("net: dsa: Add devlink port regions support to
      DSA") added devlink port regions, which are set up by the driver and not
      by DSA.
      
      When we trigger the devlink port deregistration and reregistration as
      unused, devlink now prints another WARN_ON, from here:
      
      devlink_port_unregister:
      	WARN_ON(!list_empty(&devlink_port->region_list));
      
      So the port still has regions, which makes sense, because they were set
      up by the driver, and the driver doesn't know we're unregistering the
      devlink port.
      
      Somebody needs to tear them down, and optionally (actually it would be
      nice, to be consistent) set them up again for the new devlink port.
      
      But DSA's layering stays in our way quite badly here.
      
      The options I've considered are:
      
      1. Introduce a function in devlink to just change a port's type and
         flavour. No dice, devlink keeps a lot of state, it really wants the
         port to not be registered when you set its parameters, so changing
         anything can only be done by destroying what we currently have and
         recreating it.
      
      2. Make DSA cache the parameters passed to dsa_devlink_port_region_create,
         and the region returned, keep those in a list, then when the devlink
         port unregister needs to take place, the existing devlink regions are
         destroyed by DSA, and we replay the creation of new regions using the
         cached parameters. Problem: mv88e6xxx keeps the region pointers in
         chip->ports[port].region, and these will remain stale after DSA frees
         them. There are many things DSA can do, but updating mv88e6xxx's
         private pointers is not one of them.
      
      3. Just let the driver do it (i.e. introduce a very specific method
         called ds->ops->port_reinit_as_unused, which unregisters its devlink
         port devlink regions, then the old devlink port, then registers the
         new one, then the devlink port regions for it). While it does work,
         as opposed to the others, it's pretty horrible from an API
         perspective and we can do better.
      
      4. Introduce a new pair of methods, ->port_setup and ->port_teardown,
         which in the case of mv88e6xxx must register and unregister the
         devlink port regions. Call these 2 methods when the port must be
         reinitialized as unused.
      
      Naturally, I went for the 4th approach.
      
      Fixes: 08156ba4 ("net: dsa: Add devlink port regions support to DSA")
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fd292c18
    • V
      net: dsa: be compatible with masters which unregister on shutdown · 0650bf52
      Vladimir Oltean 提交于
      Lino reports that on his system with bcmgenet as DSA master and KSZ9897
      as a switch, rebooting or shutting down never works properly.
      
      What does the bcmgenet driver have special to trigger this, that other
      DSA masters do not? It has an implementation of ->shutdown which simply
      calls its ->remove implementation. Otherwise said, it unregisters its
      network interface on shutdown.
      
      This message can be seen in a loop, and it hangs the reboot process there:
      
      unregister_netdevice: waiting for eth0 to become free. Usage count = 3
      
      So why 3?
      
      A usage count of 1 is normal for a registered network interface, and any
      virtual interface which links itself as an upper of that will increment
      it via dev_hold. In the case of DSA, this is the call path:
      
      dsa_slave_create
      -> netdev_upper_dev_link
         -> __netdev_upper_dev_link
            -> __netdev_adjacent_dev_insert
               -> dev_hold
      
      So a DSA switch with 3 interfaces will result in a usage count elevated
      by two, and netdev_wait_allrefs will wait until they have gone away.
      
      Other stacked interfaces, like VLAN, watch NETDEV_UNREGISTER events and
      delete themselves, but DSA cannot just vanish and go poof, at most it
      can unbind itself from the switch devices, but that must happen strictly
      earlier compared to when the DSA master unregisters its net_device, so
      reacting on the NETDEV_UNREGISTER event is way too late.
      
      It seems that it is a pretty established pattern to have a driver's
      ->shutdown hook redirect to its ->remove hook, so the same code is
      executed regardless of whether the driver is unbound from the device, or
      the system is just shutting down. As Florian puts it, it is quite a big
      hammer for bcmgenet to unregister its net_device during shutdown, but
      having a common code path with the driver unbind helps ensure it is well
      tested.
      
      So DSA, for better or for worse, has to live with that and engage in an
      arms race of implementing the ->shutdown hook too, from all individual
      drivers, and do something sane when paired with masters that unregister
      their net_device there. The only sane thing to do, of course, is to
      unlink from the master.
      
      However, complications arise really quickly.
      
      The pattern of redirecting ->shutdown to ->remove is not unique to
      bcmgenet or even to net_device drivers. In fact, SPI controllers do it
      too (see dspi_shutdown -> dspi_remove), and presumably, I2C controllers
      and MDIO controllers do it too (this is something I have not researched
      too deeply, but even if this is not the case today, it is certainly
      plausible to happen in the future, and must be taken into consideration).
      
      Since DSA switches might be SPI devices, I2C devices, MDIO devices, the
      insane implication is that for the exact same DSA switch device, we
      might have both ->shutdown and ->remove getting called.
      
      So we need to do something with that insane environment. The pattern
      I've come up with is "if this, then not that", so if either ->shutdown
      or ->remove gets called, we set the device's drvdata to NULL, and in the
      other hook, we check whether the drvdata is NULL and just do nothing.
      This is probably not necessary for platform devices, just for devices on
      buses, but I would really insist for consistency among drivers, because
      when code is copy-pasted, it is not always copy-pasted from the best
      sources.
      
      So depending on whether the DSA switch's ->remove or ->shutdown will get
      called first, we cannot really guarantee even for the same driver if
      rebooting will result in the same code path on all platforms. But
      nonetheless, we need to do something minimally reasonable on ->shutdown
      too to fix the bug. Of course, the ->remove will do more (a full
      teardown of the tree, with all data structures freed, and this is why
      the bug was not caught for so long). The new ->shutdown method is kept
      separate from dsa_unregister_switch not because we couldn't have
      unregistered the switch, but simply in the interest of doing something
      quick and to the point.
      
      The big question is: does the DSA switch's ->shutdown get called earlier
      than the DSA master's ->shutdown? If not, there is still a risk that we
      might still trigger the WARN_ON in unregister_netdevice that says we are
      attempting to unregister a net_device which has uppers. That's no good.
      Although the reference to the master net_device won't physically go away
      even if DSA's ->shutdown comes afterwards, remember we have a dev_hold
      on it.
      
      The answer to that question lies in this comment above device_link_add:
      
       * A side effect of the link creation is re-ordering of dpm_list and the
       * devices_kset list by moving the consumer device and all devices depending
       * on it to the ends of these lists (that does not happen to devices that have
       * not been registered when this function is called).
      
      so the fact that DSA uses device_link_add towards its master is not
      exactly for nothing. device_shutdown() walks devices_kset from the back,
      so this is our guarantee that DSA's shutdown happens before the master's
      shutdown.
      
      Fixes: 2f1e8ea7 ("net: dsa: link interfaces with the DSA master to get rid of lockdep warnings")
      Link: https://lore.kernel.org/netdev/20210909095324.12978-1-LinoSanfilippo@gmx.de/Reported-by: NLino Sanfilippo <LinoSanfilippo@gmx.de>
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Tested-by: NAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0650bf52
  20. 16 9月, 2021 1 次提交
    • V
      net: dsa: flush switchdev workqueue before tearing down CPU/DSA ports · a57d8c21
      Vladimir Oltean 提交于
      Sometimes when unbinding the mv88e6xxx driver on Turris MOX, these error
      messages appear:
      
      mv88e6085 d0032004.mdio-mii:12: port 1 failed to delete be:79:b4:9e:9e:96 vid 1 from fdb: -2
      mv88e6085 d0032004.mdio-mii:12: port 1 failed to delete be:79:b4:9e:9e:96 vid 0 from fdb: -2
      mv88e6085 d0032004.mdio-mii:12: port 1 failed to delete d8:58:d7:00:ca:6d vid 100 from fdb: -2
      mv88e6085 d0032004.mdio-mii:12: port 1 failed to delete d8:58:d7:00:ca:6d vid 1 from fdb: -2
      mv88e6085 d0032004.mdio-mii:12: port 1 failed to delete d8:58:d7:00:ca:6d vid 0 from fdb: -2
      
      (and similarly for other ports)
      
      What happens is that DSA has a policy "even if there are bugs, let's at
      least not leak memory" and dsa_port_teardown() clears the dp->fdbs and
      dp->mdbs lists, which are supposed to be empty.
      
      But deleting that cleanup code, the warnings go away.
      
      => the FDB and MDB lists (used for refcounting on shared ports, aka CPU
      and DSA ports) will eventually be empty, but are not empty by the time
      we tear down those ports. Aka we are deleting them too soon.
      
      The addresses that DSA complains about are host-trapped addresses: the
      local addresses of the ports, and the MAC address of the bridge device.
      
      The problem is that offloading those entries happens from a deferred
      work item scheduled by the SWITCHDEV_FDB_DEL_TO_DEVICE handler, and this
      races with the teardown of the CPU and DSA ports where the refcounting
      is kept.
      
      In fact, not only it races, but fundamentally speaking, if we iterate
      through the port list linearly, we might end up tearing down the shared
      ports even before we delete a DSA user port which has a bridge upper.
      
      So as it turns out, we need to first tear down the user ports (and the
      unused ones, for no better place of doing that), then the shared ports
      (the CPU and DSA ports). In between, we need to ensure that all work
      items scheduled by our switchdev handlers (which only run for user
      ports, hence the reason why we tear them down first) have finished.
      
      Fixes: 161ca59d ("net: dsa: reference count the MDB entries at the cross-chip notifier level")
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Link: https://lore.kernel.org/r/20210914134726.2305133-1-vladimir.oltean@nxp.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      a57d8c21
  21. 23 8月, 2021 1 次提交
    • V
      net: dsa: track unique bridge numbers across all DSA switch trees · f5e165e7
      Vladimir Oltean 提交于
      Right now, cross-tree bridging setups work somewhat by mistake.
      
      In the case of cross-tree bridging with sja1105, all switch instances
      need to agree upon a common VLAN ID for forwarding a packet that belongs
      to a certain bridging domain.
      
      With TX forwarding offload, the VLAN ID is the bridge VLAN for
      VLAN-aware bridging, and the tag_8021q TX forwarding offload VID
      (a VLAN which has non-zero VBID bits) for VLAN-unaware bridging.
      
      The VBID for VLAN-unaware bridging is derived from the dp->bridge_num
      value calculated by DSA independently for each switch tree.
      
      If ports from one tree join one bridge, and ports from another tree join
      another bridge, DSA will assign them the same bridge_num, even though
      the bridges are different. If cross-tree bridging is supported, this
      is an issue.
      
      Modify DSA to calculate the bridge_num globally across all switch trees.
      This has the implication for a driver that the dp->bridge_num value that
      DSA will assign to its ports might not be contiguous, if there are
      boards with multiple DSA drivers instantiated. Additionally, all
      bridge_num values eat up towards each switch's
      ds->num_fwd_offloading_bridges maximum, which is potentially unfortunate,
      and can be seen as a limitation introduced by this patch. However, that
      is the lesser evil for now.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f5e165e7
  22. 12 8月, 2021 1 次提交
    • V
      net: dsa: tag_8021q: don't broadcast during setup/teardown · 724395f4
      Vladimir Oltean 提交于
      Currently, on my board with multiple sja1105 switches in disjoint trees
      described in commit f66a6a69 ("net: dsa: permit cross-chip bridging
      between all trees in the system"), rebooting the board triggers the
      following benign warnings:
      
      [   12.345566] sja1105 spi2.0: port 0 failed to notify tag_8021q VLAN 1088 deletion: -ENOENT
      [   12.353804] sja1105 spi2.0: port 0 failed to notify tag_8021q VLAN 2112 deletion: -ENOENT
      [   12.362019] sja1105 spi2.0: port 1 failed to notify tag_8021q VLAN 1089 deletion: -ENOENT
      [   12.370246] sja1105 spi2.0: port 1 failed to notify tag_8021q VLAN 2113 deletion: -ENOENT
      [   12.378466] sja1105 spi2.0: port 2 failed to notify tag_8021q VLAN 1090 deletion: -ENOENT
      [   12.386683] sja1105 spi2.0: port 2 failed to notify tag_8021q VLAN 2114 deletion: -ENOENT
      
      Basically switch 1 calls dsa_tag_8021q_unregister, and switch 1's TX and
      RX VLANs cannot be found on switch 2's CPU port.
      
      But why would switch 2 even attempt to delete switch 1's TX and RX
      tag_8021q VLANs from its CPU port? Well, because we use dsa_broadcast,
      and it is supposed that it had added those VLANs in the first place
      (because in dsa_port_tag_8021q_vlan_match, all CPU ports match
      regardless of their tree index or switch index).
      
      The two trees probe asynchronously, and when switch 1 probed, it called
      dsa_broadcast which did not notify the tree of switch 2, because that
      didn't probe yet. But during unbind, switch 2's tree _is_ probed, so it
      _is_ notified of the deletion.
      
      Before jumping to introduce a synchronization mechanism between the
      probing across disjoint switch trees, let's take a step back and see
      whether we _need_ to do that in the first place.
      
      The RX and TX VLANs of switch 1 would be needed on switch 2's CPU port
      only if switch 1 and 2 were part of a cross-chip bridge. And
      dsa_tag_8021q_bridge_join takes care precisely of that (but if probing
      was synchronous, the bridge_join would just end up bumping the VLANs'
      refcount, because they are already installed by the setup path).
      
      Since by the time the ports are bridged, all DSA trees are already set
      up, and we don't need the tag_8021q VLANs of one switch installed on the
      other switches during probe time, the answer is that we don't need to
      fix the synchronization issue.
      
      So make the setup and teardown code paths call dsa_port_notify, which
      notifies only the local tree, and the bridge code paths call
      dsa_broadcast, which let the other trees know as well.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      724395f4
  23. 09 8月, 2021 1 次提交
    • L
      devlink: Set device as early as possible · 919d13a7
      Leon Romanovsky 提交于
      All kernel devlink implementations call to devlink_alloc() during
      initialization routine for specific device which is used later as
      a parent device for devlink_register().
      
      Such late device assignment causes to the situation which requires us to
      call to device_register() before setting other parameters, but that call
      opens devlink to the world and makes accessible for the netlink users.
      
      Any attempt to move devlink_register() to be the last call generates the
      following error due to access to the devlink->dev pointer.
      
      [    8.758862]  devlink_nl_param_fill+0x2e8/0xe50
      [    8.760305]  devlink_param_notify+0x6d/0x180
      [    8.760435]  __devlink_params_register+0x2f1/0x670
      [    8.760558]  devlink_params_register+0x1e/0x20
      
      The simple change of API to set devlink device in the devlink_alloc()
      instead of devlink_register() fixes all this above and ensures that
      prior to call to devlink_register() everything already set.
      Signed-off-by: NLeon Romanovsky <leonro@nvidia.com>
      Reviewed-by: NJiri Pirko <jiri@nvidia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      919d13a7
  24. 05 8月, 2021 2 次提交
    • V
      net: dsa: give preference to local CPU ports · 2c0b0325
      Vladimir Oltean 提交于
      Be there an "H" switch topology, where there are 2 switches connected as
      follows:
      
               eth0                                                     eth1
                |                                                        |
             CPU port                                                CPU port
                |                        DSA link                        |
       sw0p0  sw0p1  sw0p2  sw0p3  sw0p4 -------- sw1p4  sw1p3  sw1p2  sw1p1  sw1p0
         |             |      |                            |      |             |
       user          user   user                         user   user          user
       port          port   port                         port   port          port
      
      basically one where each switch has its own CPU port for termination,
      but there is also a DSA link in case packets need to be forwarded in
      hardware between one switch and another.
      
      DSA insists to see this as a daisy chain topology, basically registering
      all network interfaces as sw0p0@eth0, ... sw1p0@eth0 and disregarding
      eth1 as a valid DSA master.
      
      This is only half the story, since when asked using dsa_port_is_cpu(),
      DSA will respond that sw1p1 is a CPU port, however one which has no
      dp->cpu_dp pointing to it. So sw1p1 is enabled, but not used.
      
      Furthermore, be there a driver for switches which support only one
      upstream port. This driver iterates through its ports and checks using
      dsa_is_upstream_port() whether the current port is an upstream one.
      For switch 1, two ports pass the "is upstream port" checks:
      
      - sw1p4 is an upstream port because it is a routing port towards the
        dedicated CPU port assigned using dsa_tree_setup_default_cpu()
      
      - sw1p1 is also an upstream port because it is a CPU port, albeit one
        that is disabled. This is because dsa_upstream_port() returns:
      
      	if (!cpu_dp)
      		return port;
      
        which means that if @dp does not have a ->cpu_dp pointer (which is a
        characteristic of CPU ports themselves as well as unused ports), then
        @dp is its own upstream port.
      
      So the driver for switch 1 rightfully says: I have two upstream ports,
      but I don't support multiple upstream ports! So let me error out, I
      don't know which one to choose and what to do with the other one.
      
      Generally I am against enforcing any default policy in the kernel in
      terms of user to CPU port assignment (like round robin or such) but this
      case is different. To solve the conundrum, one would have to:
      
      - Disable sw1p1 in the device tree or mark it as "not a CPU port" in
        order to comply with DSA's view of this topology as a daisy chain,
        where the termination traffic from switch 1 must pass through switch 0.
        This is counter-productive because it wastes 1Gbps of termination
        throughput in switch 1.
      - Disable the DSA link between sw0p4 and sw1p4 and do software
        forwarding between switch 0 and 1, and basically treat the switches as
        part of disjoint switch trees. This is counter-productive because it
        wastes 1Gbps of autonomous forwarding throughput between switch 0 and 1.
      - Treat sw0p4 and sw1p4 as user ports instead of DSA links. This could
        work, but it makes cross-chip bridging impossible. In this setup we
        would need to have 2 separate bridges, br0 spanning the ports of
        switch 0, and br1 spanning the ports of switch 1, and the "DSA links
        treated as user ports" sw0p4 (part of br0) and sw1p4 (part of br1) are
        the gateway ports between one bridge and another. This is hard to
        manage from a user's perspective, who wants to have a unified view of
        the switching fabric and the ability to transparently add ports to the
        same bridge. VLANs would also need to be explicitly managed by the
        user on these gateway ports.
      
      So it seems that the only reasonable thing to do is to make DSA prefer
      CPU ports that are local to the switch. Meaning that by default, the
      user and DSA ports of switch 0 will get assigned to the CPU port from
      switch 0 (sw0p1) and the user and DSA ports of switch 1 will get
      assigned to the CPU port from switch 1.
      
      The way this solves the problem is that sw1p4 is no longer an upstream
      port as far as switch 1 is concerned (it no longer views sw0p1 as its
      dedicated CPU port).
      
      So here we are, the first multi-CPU port that DSA supports is also
      perhaps the most uneventful one: the individual switches don't support
      multiple CPUs, however the DSA switch tree as a whole does have multiple
      CPU ports. No user space assignment of user ports to CPU ports is
      desirable, necessary, or possible.
      
      Ports that do not have a local CPU port (say there was an extra switch
      hanging off of sw0p0) default to the standard implementation of getting
      assigned to the first CPU port of the DSA switch tree. Is that good
      enough? Probably not (if the downstream switch was hanging off of switch
      1, we would most certainly prefer its CPU port to be sw1p1), but in
      order to support that use case too, we would need to traverse the
      dst->rtable in search of an optimum dedicated CPU port, one that has the
      smallest number of hops between dp->ds and dp->cpu_dp->ds. At the
      moment, the DSA routing table structure does not keep the number of hops
      between dl->dp and dl->link_dp, and while it is probably deducible,
      there is zero justification to write that code now. Let's hope DSA will
      never have to support that use case.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2c0b0325
    • V
      net: dsa: rename teardown_default_cpu to teardown_cpu_ports · 0e8eb9a1
      Vladimir Oltean 提交于
      There is nothing specific to having a default CPU port to what
      dsa_tree_teardown_default_cpu() does. Even with multiple CPU ports,
      it would do the same thing: iterate through the ports of this switch
      tree and reset the ->cpu_dp pointer to NULL. So rename it accordingly.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0e8eb9a1
  25. 23 7月, 2021 2 次提交
    • V
      net: dsa: add support for bridge TX forwarding offload · 123abc06
      Vladimir Oltean 提交于
      For a DSA switch, to offload the forwarding process of a bridge device
      means to send the packets coming from the software bridge as data plane
      packets. This is contrary to everything that DSA has done so far,
      because the current taggers only know to send control packets (ones that
      target a specific destination port), whereas data plane packets are
      supposed to be forwarded according to the FDB lookup, much like packets
      ingressing on any regular ingress port. If the FDB lookup process
      returns multiple destination ports (flooding, multicast), then
      replication is also handled by the switch hardware - the bridge only
      sends a single packet and avoids the skb_clone().
      
      DSA keeps for each bridge port a zero-based index (the number of the
      bridge). Multiple ports performing TX forwarding offload to the same
      bridge have the same dp->bridge_num value, and ports not offloading the
      TX data plane of a bridge have dp->bridge_num = -1.
      
      The tagger can check if the packet that is being transmitted on has
      skb->offload_fwd_mark = true or not. If it does, it can be sure that the
      packet belongs to the data plane of a bridge, further information about
      which can be obtained based on dp->bridge_dev and dp->bridge_num.
      It can then compose a DSA tag for injecting a data plane packet into
      that bridge number.
      
      For the switch driver side, we offer two new dsa_switch_ops methods,
      called .port_bridge_fwd_offload_{add,del}, which are modeled after
      .port_bridge_{join,leave}.
      These methods are provided in case the driver needs to configure the
      hardware to treat packets coming from that bridge software interface as
      data plane packets. The switchdev <-> bridge interaction happens during
      the netdev_master_upper_dev_link() call, so to switch drivers, the
      effect is that the .port_bridge_fwd_offload_add() method is called
      immediately after .port_bridge_join().
      
      If the bridge number exceeds the number of bridges for which the switch
      driver can offload the TX data plane (and this includes the case where
      the driver can offload none), DSA falls back to simply returning
      tx_fwd_offload = false in the switchdev_bridge_port_offload() call.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      123abc06
    • V
      net: dsa: track the number of switches in a tree · 5b22d366
      Vladimir Oltean 提交于
      In preparation of supporting data plane forwarding on behalf of a
      software bridge, some drivers might need to view bridges as virtual
      switches behind the CPU port in a cross-chip topology.
      
      Give them some help and let them know how many physical switches there
      are in the tree, so that they can count the virtual switches starting
      from that number on.
      
      Note that the first dsa_switch_ops method where this information is
      reliably available is .setup(). This is because of how DSA works:
      in a tree with 3 switches, each calling dsa_register_switch(), the first
      2 will advance until dsa_tree_setup() -> dsa_tree_setup_routing_table()
      and exit with error code 0 because the topology is not complete. Since
      probing is parallel at this point, one switch does not know about the
      existence of the other. Then the third switch comes, and for it,
      dsa_tree_setup_routing_table() returns complete = true. This switch goes
      ahead and calls dsa_tree_setup_switches() for everybody else, calling
      their .setup() methods too. This acts as the synchronization point.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5b22d366
  26. 30 6月, 2021 1 次提交