1. 22 7月, 2021 1 次提交
    • V
      net: bridge: move the switchdev object replay helpers to "push" mode · 4e51bf44
      Vladimir Oltean 提交于
      Starting with commit 4f2673b3 ("net: bridge: add helper to replay
      port and host-joined mdb entries"), DSA has introduced some bridge
      helpers that replay switchdev events (FDB/MDB/VLAN additions and
      deletions) that can be lost by the switchdev drivers in a variety of
      circumstances:
      
      - an IP multicast group was host-joined on the bridge itself before any
        switchdev port joined the bridge, leading to the host MDB entries
        missing in the hardware database.
      - during the bridge creation process, the MAC address of the bridge was
        added to the FDB as an entry pointing towards the bridge device
        itself, but with no switchdev ports being part of the bridge yet, this
        local FDB entry would remain unknown to the switchdev hardware
        database.
      - a VLAN/FDB/MDB was added to a bridge port that is a LAG interface,
        before any switchdev port joined that LAG, leading to the hardware
        database missing those entries.
      - a switchdev port left a LAG that is a bridge port, while the LAG
        remained part of the bridge, and all FDB/MDB/VLAN entries remained
        installed in the hardware database of the switchdev port.
      
      Also, since commit 0d2cfbd4 ("net: bridge: ignore switchdev events
      for LAG ports which didn't request replay"), DSA introduced a method,
      based on a const void *ctx, to ensure that two switchdev ports under the
      same LAG that is a bridge port do not see the same MDB/VLAN entry being
      replayed twice by the bridge, once for every bridge port that joins the
      LAG.
      
      With so many ordering corner cases being possible, it seems unreasonable
      to expect a switchdev driver writer to get it right from the first try.
      Therefore, now that DSA has experimented with the bridge replay helpers
      for a little bit, we can move the code to the bridge driver where it is
      more readily available to all switchdev drivers.
      
      To convert the switchdev object replay helpers from "pull mode" (where
      the driver asks for them) to a "push mode" (where the bridge offers them
      automatically), the biggest problem is that the bridge needs to be aware
      when a switchdev port joins and leaves, even when the switchdev is only
      indirectly a bridge port (for example when the bridge port is a LAG
      upper of the switchdev).
      
      Luckily, we already have a hook for that, in the form of the newly
      introduced switchdev_bridge_port_offload() and
      switchdev_bridge_port_unoffload() calls. These offer a natural place for
      hooking the object addition and deletion replays.
      
      Extend the above 2 functions with:
      - pointers to the switchdev atomic notifier (for FDB replays) and the
        blocking notifier (for MDB and VLAN replays).
      - the "const void *ctx" argument required for drivers to be able to
        disambiguate between which port is targeted, when multiple ports are
        lowers of the same LAG that is a bridge port. Most of the drivers pass
        NULL to this argument, except the ones that support LAG offload and have
        the proper context check already in place in the switchdev blocking
        notifier handler.
      
      Also unexport the replay helpers, since nobody except the bridge calls
      them directly now.
      
      Note that:
      (a) we abuse the terminology slightly, because FDB entries are not
          "switchdev objects", but we count them as objects nonetheless.
          With no direct way to prove it, I think they are not modeled as
          switchdev objects because those can only be installed by the bridge
          to the hardware (as opposed to FDB entries which can be propagated
          in the other direction too). This is merely an abuse of terms, FDB
          entries are replayed too, despite not being objects.
      (b) the bridge does not attempt to sync port attributes to newly joined
          ports, just the countable stuff (the objects). The reason for this
          is simple: no universal and symmetric way to sync and unsync them is
          known. For example, VLAN filtering: what to do on unsync, disable or
          leave it enabled? Similarly, STP state, ageing timer, etc etc. What
          a switchdev port does when it becomes standalone again is not really
          up to the bridge's competence, and the driver should deal with it.
          On the other hand, replaying deletions of switchdev objects can be
          seen a matter of cleanup and therefore be treated by the bridge,
          hence this patch.
      
      We make the replay helpers opt-in for drivers, because they might not
      bring immediate benefits for them:
      
      - nbp_vlan_init() is called _after_ netdev_master_upper_dev_link(),
        so br_vlan_replay() should not do anything for the new drivers on
        which we call it. The existing drivers where there was even a slight
        possibility for there to exist a VLAN on a bridge port before they
        join it are already guarded against this: mlxsw and prestera deny
        joining LAG interfaces that are members of a bridge.
      
      - br_fdb_replay() should now notify of local FDB entries, but I patched
        all drivers except DSA to ignore these new entries in commit
        2c4eca3e ("net: bridge: switchdev: include local flag in FDB
        notifications"). Driver authors can lift this restriction as they
        wish, and when they do, they can also opt into the FDB replay
        functionality.
      
      - br_mdb_replay() should fix a real issue which is described in commit
        4f2673b3 ("net: bridge: add helper to replay port and host-joined
        mdb entries"). However most drivers do not offload the
        SWITCHDEV_OBJ_ID_HOST_MDB to see this issue: only cpsw and am65_cpsw
        offload this switchdev object, and I don't completely understand the
        way in which they offload this switchdev object anyway. So I'll leave
        it up to these drivers' respective maintainers to opt into
        br_mdb_replay().
      
      So most of the drivers pass NULL notifier blocks for the replay helpers,
      except:
      - dpaa2-switch which was already acked/regression-tested with the
        helpers enabled (and there isn't much of a downside in having them)
      - ocelot which already had replay logic in "pull" mode
      - DSA which already had replay logic in "pull" mode
      
      An important observation is that the drivers which don't currently
      request bridge event replays don't even have the
      switchdev_bridge_port_{offload,unoffload} calls placed in proper places
      right now. This was done to avoid unnecessary rework for drivers which
      might never even add support for this. For driver writers who wish to
      add replay support, this can be used as a tentative placement guide:
      https://patchwork.kernel.org/project/netdevbpf/patch/20210720134655.892334-11-vladimir.oltean@nxp.com/
      
      Cc: Vadym Kochan <vkochan@marvell.com>
      Cc: Taras Chornyi <tchornyi@marvell.com>
      Cc: Ioana Ciornei <ioana.ciornei@nxp.com>
      Cc: Lars Povlsen <lars.povlsen@microchip.com>
      Cc: Steen Hegelund <Steen.Hegelund@microchip.com>
      Cc: UNGLinuxDriver@microchip.com
      Cc: Claudiu Manoil <claudiu.manoil@nxp.com>
      Cc: Alexandre Belloni <alexandre.belloni@bootlin.com>
      Cc: Grygorii Strashko <grygorii.strashko@ti.com>
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Acked-by: Ioana Ciornei <ioana.ciornei@nxp.com> # dpaa2-switch
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4e51bf44
  2. 20 7月, 2021 3 次提交
    • V
      net: dsa: use switchdev_handle_fdb_{add,del}_to_device · b94dc99c
      Vladimir Oltean 提交于
      Using the new fan-out helper for FDB entries installed on the software
      bridge, we can install host addresses with the proper refcount on the
      CPU port, such that this case:
      
      ip link set swp0 master br0
      ip link set swp1 master br0
      ip link set swp2 master br0
      ip link set swp3 master br0
      ip link set br0 address 00:01:02:03:04:05
      ip link set swp3 nomaster
      
      works properly and the br0 address remains installed as a host entry
      with refcount 3 instead of getting deleted.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b94dc99c
    • V
      net: dsa: tag_8021q: add proper cross-chip notifier support · c64b9c05
      Vladimir Oltean 提交于
      The big problem which mandates cross-chip notifiers for tag_8021q is
      this:
      
                                                   |
          sw0p0     sw0p1     sw0p2     sw0p3     sw0p4
       [  user ] [  user ] [  user ] [  dsa  ] [  cpu  ]
                                         |
                                         +---------+
                                                   |
          sw1p0     sw1p1     sw1p2     sw1p3     sw1p4
       [  user ] [  user ] [  user ] [  dsa  ] [  dsa  ]
                                         |
                                         +---------+
                                                   |
          sw2p0     sw2p1     sw2p2     sw2p3     sw2p4
       [  user ] [  user ] [  user ] [  dsa  ] [  dsa  ]
      
      When the user runs:
      
      ip link add br0 type bridge
      ip link set sw0p0 master br0
      ip link set sw2p0 master br0
      
      It doesn't work.
      
      This is because dsa_8021q_crosschip_bridge_join() assumes that "ds" and
      "other_ds" are at most 1 hop away from each other, so it is sufficient
      to add the RX VLAN of {ds, port} into {other_ds, other_port} and vice
      versa and presto, the cross-chip link works. When there is another
      switch in the middle, such as in this case switch 1 with its DSA links
      sw1p3 and sw1p4, somebody needs to tell it about these VLANs too.
      
      Which is exactly why the problem is quadratic: when a port joins a
      bridge, for each port in the tree that's already in that same bridge we
      notify a tag_8021q VLAN addition of that port's RX VLAN to the entire
      tree. It is a very complicated web of VLANs.
      
      It must be mentioned that currently we install tag_8021q VLANs on too
      many ports (DSA links - to be precise, on all of them). For example,
      when sw2p0 joins br0, and assuming sw1p0 was part of br0 too, we add the
      RX VLAN of sw2p0 on the DSA links of switch 0 too, even though there
      isn't any port of switch 0 that is a member of br0 (at least yet).
      In theory we could notify only the switches which sit in between the
      port joining the bridge and the port reacting to that bridge_join event.
      But in practice that is impossible, because of the way 'link' properties
      are described in the device tree. The DSA bindings require DT writers to
      list out not only the real/physical DSA links, but in fact the entire
      routing table, like for example switch 0 above will have:
      
      	sw0p3: port@3 {
      		link = <&sw1p4 &sw2p4>;
      	};
      
      This was done because:
      
      /* TODO: ideally DSA ports would have a single dp->link_dp member,
       * and no dst->rtable nor this struct dsa_link would be needed,
       * but this would require some more complex tree walking,
       * so keep it stupid at the moment and list them all.
       */
      
      but it is a perfect example of a situation where too much information is
      actively detrimential, because we are now in the position where we
      cannot distinguish a real DSA link from one that is put there to avoid
      the 'complex tree walking'. And because DT is ABI, there is not much we
      can change.
      
      And because we do not know which DSA links are real and which ones
      aren't, we can't really know if DSA switch A is in the data path between
      switches B and C, in the general case.
      
      So this is why tag_8021q RX VLANs are added on all DSA links, and
      probably why it will never change.
      
      On the other hand, at least the number of additions/deletions is well
      balanced, and this means that once we implement reference counting at
      the cross-chip notifier level a la fdb/mdb, there is absolutely zero
      need for a struct dsa_8021q_crosschip_link, it's all self-managing.
      
      In fact, with the tag_8021q notifiers emitted from the bridge join
      notifiers, it becomes so generic that sja1105 does not need to do
      anything anymore, we can just delete its implementation of the
      .crosschip_bridge_{join,leave} methods.
      
      Among other things we can simply delete is the home-grown implementation
      of sja1105_notify_crosschip_switches(). The reason why that is wrong is
      because it is not quadratic - it only covers remote switches to which we
      have a cross-chip bridging link and that does not cover in-between
      switches. This deletion is part of the same patch because sja1105 used
      to poke deep inside the guts of the tag_8021q context in order to do
      that. Because the cross-chip links went away, so needs the sja1105 code.
      
      Last but not least, dsa_8021q_setup_port() is simplified (and also
      renamed). Because our TAG_8021Q_VLAN_ADD notifier is designed to react
      on the CPU port too, the four dsa_8021q_vid_apply() calls:
      - 1 for RX VLAN on user port
      - 1 for the user port's RX VLAN on the CPU port
      - 1 for TX VLAN on user port
      - 1 for the user port's TX VLAN on the CPU port
      
      now get squashed into only 2 notifier calls via
      dsa_port_tag_8021q_vlan_add.
      
      And because the notifiers to add and to delete a tag_8021q VLAN are
      distinct, now we finally break up the port setup and teardown into
      separate functions instead of relying on a "bool enabled" flag which
      tells us what to do. Arguably it should have been this way from the
      get go.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c64b9c05
    • V
      net: dsa: tag_8021q: manage RX VLANs dynamically at bridge join/leave time · e19cc13c
      Vladimir Oltean 提交于
      There has been at least one wasted opportunity for tag_8021q to be used
      by a driver:
      
      https://patchwork.ozlabs.org/project/netdev/patch/20200710113611.3398-3-kurt@linutronix.de/#2484272
      
      because of a design decision: the declared purpose of tag_8021q is to
      offer source port/switch identification for a tagging driver for packets
      coming from a switch with no hardware DSA tagging support. It is not
      intended to provide VLAN-based port isolation, because its first user,
      sja1105, had another mechanism for bridging domain isolation, the L2
      Forwarding Table. So even if 2 ports are in the same VLAN but they are
      separated via the L2 Forwarding Table, they will not communicate with
      one another. The L2 Forwarding Table is managed by the
      sja1105_bridge_join() and sja1105_bridge_leave() methods.
      
      As a consequence, today tag_8021q does not bother too much with hooking
      into .port_bridge_join() and .port_bridge_leave() because that would
      introduce yet another degree of freedom, it just iterates statically
      through all ports of a switch and adds the RX VLAN of one port to all
      the others. In this way, whenever .port_bridge_join() is called,
      bridging will magically work because the RX VLANs are already installed
      everywhere they need to be.
      
      This is not to say that the reason for the change in this patch is to
      satisfy the hellcreek and similar use cases, that is merely a nice side
      effect. Instead it is to make sja1105 cross-chip links work properly
      over a DSA link.
      
      For context, sja1105 today supports a degenerate form of cross-chip
      bridging, where the switches are interconnected through their CPU ports
      ("disjoint trees" topology). There is some code which has been
      generalized into dsa_8021q_crosschip_link_{add,del}, but it is not
      enough, and frankly it is impossible to build upon that.
      Real multi-switch DSA trees, like daisy chains or H trees, which have
      actual DSA links, do not work.
      
      The problem is that sja1105 is unlike mv88e6xxx, and does not have a PVT
      for cross-chip bridging, which is a table by which the local switch can
      select the forwarding domain for packets from a certain ingress switch
      ID and source port. The sja1105 switches cannot parse their own DSA
      tags, because, well, they don't really have support for DSA tags, it's
      all VLANs.
      
      So to make something like cross-chip bridging between sw0p0 and sw1p0 to
      work over the sw0p3/sw1p3 DSA link to work with sja1105 in the topology
      below:
      
                               |                                  |
          sw0p0     sw0p1     sw0p2     sw0p3          sw1p3     sw1p2     sw1p1     sw1p0
       [  user ] [  user ] [  cpu  ] [  dsa  ] ---- [  dsa  ] [  cpu  ] [  user ] [  user ]
      
      we need to ask ourselves 2 questions:
      
      (1) how should the L2 Forwarding Table be managed?
      (2) how should the VLAN Lookup Table be managed?
      
      i.e. what should prevent packets from going to unwanted ports?
      
      Since as mentioned, there is no PVT, the L2 Forwarding Table only
      contains forwarding rules for local ports. So we can say "all user ports
      are allowed to forward to all CPU ports and all DSA links".
      
      If we allow forwarding to DSA links unconditionally, this means we must
      prevent forwarding using the VLAN Lookup Table. This is in fact
      asymmetric with what we do for tag_8021q on ports local to the same
      switch, and it matters because now that we are making tag_8021q a core
      DSA feature, we need to hook into .crosschip_bridge_join() to add/remove
      the tag_8021q VLANs. So for symmetry it makes sense to manage the VLANs
      for local forwarding in the same way as cross-chip forwarding.
      
      Note that there is a very precise reason why tag_8021q hooks into
      dsa_switch_bridge_join() which acts at the cross-chip notifier level,
      and not at a higher level such as dsa_port_bridge_join(). We need to
      install the RX VLAN of the newly joining port into the VLAN table of all
      the existing ports across the tree that are part of the same bridge, and
      the notifier already does the iteration through the switches for us.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e19cc13c
  3. 30 6月, 2021 3 次提交
    • V
      net: dsa: ensure during dsa_fdb_offload_notify that dev_hold and dev_put are on the same dev · 4bed397c
      Vladimir Oltean 提交于
      When
      (a) "dev" is a bridge port which the DSA switch tree offloads, but is
          otherwise not a dsa slave (such as a LAG netdev), or
      (b) "dev" is the bridge net device itself
      
      then strange things happen to the dev_hold/dev_put pair:
      dsa_schedule_work() will still be called with a DSA port that offloads
      that netdev, but dev_hold() will be called on the non-DSA netdev.
      Then the "if" condition in dsa_slave_switchdev_event_work() does not
      pass, because "dev" is not a DSA netdev, so dev_put() is not called.
      
      This results in the simple fact that we have a reference counting
      mismatch on the "dev" net device.
      
      This can be seen when we add support for host addresses installed on the
      bridge net device.
      
      ip link add br1 type bridge
      ip link set br1 address 00:01:02:03:04:05
      ip link set swp0 master br1
      ip link del br1
      [  968.512278] unregister_netdevice: waiting for br1 to become free. Usage count = 5
      
      It seems foolish to do penny pinching and not add the net_device pointer
      in the dsa_switchdev_event_work structure, so let's finally do that.
      As an added bonus, when we start offloading local entries pointing
      towards the bridge, these will now properly appear as 'offloaded' in
      'bridge fdb' (this was not possible before, because 'dev' was assumed to
      only be a DSA net device):
      
      00:01:02:03:04:05 dev br0 vlan 1 offload master br0 permanent
      00:01:02:03:04:05 dev br0 offload master br0 permanent
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4bed397c
    • V
      net: dsa: introduce a separate cross-chip notifier type for host FDBs · 3dc80afc
      Vladimir Oltean 提交于
      DSA treats some bridge FDB entries by trapping them to the CPU port.
      Currently, the only class of such entries are FDB addresses learnt by
      the software bridge on a foreign interface. However there are many more
      to be added:
      
      - FDB entries with the is_local flag (for termination) added by the
        bridge on the user ports (typically containing the MAC address of the
        bridge port)
      - FDB entries pointing towards the bridge net device (for termination).
        Typically these contain the MAC address of the bridge net device.
      - Static FDB entries installed on a foreign interface that is in the
        same bridge with a DSA user port.
      
      The reason why a separate cross-chip notifier for host FDBs is justified
      compared to normal FDBs is the same as in the case of host MDBs: the
      cross-chip notifier matching function in switch.c should avoid
      installing these entries on routing ports that route towards the
      targeted switch, but not towards the CPU. This is required in order to
      have proper support for H-like multi-chip topologies.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3dc80afc
    • V
      net: dsa: introduce a separate cross-chip notifier type for host MDBs · b8e997c4
      Vladimir Oltean 提交于
      Commit abd49535 ("net: dsa: execute dsa_switch_mdb_add only for
      routing port in cross-chip topologies") does a surprisingly good job
      even for the SWITCHDEV_OBJ_ID_HOST_MDB use case, where DSA simply
      translates a switchdev object received on dp into a cross-chip notifier
      for dp->cpu_dp.
      
      To visualize how that works, imagine the daisy chain topology below and
      consider a SWITCHDEV_OBJ_ID_HOST_MDB object emitted on sw2p0. How does
      the cross-chip notifier know to match on all the right ports (sw0p4, the
      dedicated CPU port, sw1p4, an upstream DSA link, and sw2p4, another
      upstream DSA link)?
      
                                                      |
             sw0p0     sw0p1     sw0p2     sw0p3     sw0p4
          [  user ] [  user ] [  user ] [  dsa  ] [  cpu  ]
          [       ] [       ] [       ] [       ] [   x   ]
                                            |
                                            +---------+
                                                      |
             sw1p0     sw1p1     sw1p2     sw1p3     sw1p4
          [  user ] [  user ] [  user ] [  dsa  ] [  dsa  ]
          [       ] [       ] [       ] [       ] [   x   ]
                                            |
                                            +---------+
                                                      |
             sw2p0     sw2p1     sw2p2     sw2p3     sw2p4
          [  user ] [  user ] [  user ] [  user ] [  dsa  ]
          [       ] [       ] [       ] [       ] [   x   ]
      
      The answer is simple: the dedicated CPU port of sw2p0 is sw0p4, and
      dsa_routing_port returns the upstream port for all switches.
      
      That is fine, but there are other topologies where this does not work as
      well. There are trees with "H" topologies in the wild, where there are 2
      or more switches with DSA links between them, but every switch has its
      dedicated CPU port. For these topologies, it seems stupid for the neighbor
      switches to install an MDB entry on the routing port, since these
      multicast addresses are fundamentally different than the usual ones we
      support (and that is the justification for this patch, to introduce the
      concept of a termination plane multicast MAC address, as opposed to a
      forwarding plane multicast MAC address).
      
      For example, when a SWITCHDEV_OBJ_ID_HOST_MDB would get added to sw0p0,
      without this patch, it would get treated as a regular port MDB on sw0p2
      and it would match on the ports below (including the sw1p3 routing port).
      
                               |                                  |
          sw0p0     sw0p1     sw0p2     sw0p3          sw1p3     sw1p2     sw1p1     sw1p0
       [  user ] [  user ] [  cpu  ] [  dsa  ]      [  dsa  ] [  cpu  ] [  user ] [  user ]
       [       ] [       ] [   x   ] [       ] ---- [   x   ] [       ] [       ] [       ]
      
      With the patch, the host MDB notifier on sw0p0 matches only on the local
      switch, which is what we want for a termination plane address.
      
                               |                                  |
          sw0p0     sw0p1     sw0p2     sw0p3          sw1p3     sw1p2     sw1p1     sw1p0
       [  user ] [  user ] [  cpu  ] [  dsa  ]      [  dsa  ] [  cpu  ] [  user ] [  user ]
       [       ] [       ] [   x   ] [       ] ---- [       ] [       ] [       ] [       ]
      
      Name this new matching function "dsa_switch_host_address_match" since we
      will be reusing it soon for host FDB entries as well.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b8e997c4
  4. 29 6月, 2021 1 次提交
    • V
      net: dsa: replay a deletion of switchdev objects for ports leaving a bridged LAG · 74918945
      Vladimir Oltean 提交于
      When a DSA switch port leaves a bonding interface that is under a
      bridge, there might be dangling switchdev objects on that port left
      behind, because the bridge is not aware that its lower interface (the
      bond) changed state in any way.
      
      Call the bridge replay helpers with adding=false before changing
      dp->bridge_dev to NULL, because we need to simulate to
      dsa_slave_port_obj_del() that these notifications were emitted by the
      bridge.
      
      We add this hook to the NETDEV_PRECHANGEUPPER event handler, because
      we are calling into switchdev (and the __switchdev_handle_port_obj_del
      fanout helpers expect the upper/lower adjacency lists to still be valid)
      and PRECHANGEUPPER is the last moment in time when they still are.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      74918945
  5. 22 6月, 2021 1 次提交
    • V
      net: dsa: targeted MTU notifiers should only match on one port · 88faba20
      Vladimir Oltean 提交于
      dsa_slave_change_mtu() calls dsa_port_mtu_change() twice:
      - it sends a cross-chip notifier with the MTU of the CPU port which is
        used to update the DSA links.
      - it sends one targeted MTU notifier which is supposed to only match the
        user port on which we are changing the MTU. The "propagate_upstream"
        variable is used here to bypass the cross-chip notifier system from
        switch.c
      
      But due to a mistake, the second, targeted notifier matches not only on
      the user port, but also on the DSA link which is a member of the same
      switch, if that exists.
      
      And because the DSA links of the entire dst were programmed in a
      previous round to the largest_mtu via a "propagate_upstream == true"
      notification, then the dsa_port_mtu_change(propagate_upstream == false)
      call that is immediately upcoming will break the MTU on the one DSA link
      which is chip-wise local to the dp whose MTU is changing right now.
      
      Example given this daisy chain topology:
      
         sw0p0     sw0p1     sw0p2     sw0p3     sw0p4
      [  cpu  ] [  user ] [  user ] [  dsa  ] [  user ]
      [   x   ] [       ] [       ] [   x   ] [       ]
                                        |
                                        +---------+
                                                  |
         sw1p0     sw1p1     sw1p2     sw1p3     sw1p4
      [  user ] [  user ] [  user ] [  dsa  ] [  dsa  ]
      [       ] [       ] [       ] [       ] [   x   ]
      
      ip link set sw0p1 mtu 9000
      ip link set sw1p1 mtu 9000 # at this stage, sw0p1 and sw1p1 can talk
                                 # to one another using jumbo frames
      ip link set sw0p2 mtu 1500 # this programs the sw0p3 DSA link first to
                                 # the largest_mtu of 9000, then reprograms it to
                                 # 1500 with the "propagate_upstream == false"
                                 # notifier, breaking communication between
                                 # sw0p1 and sw1p1
      
      To escape from this situation, make the targeted match really match on a
      single port - the user port, and rename the "propagate_upstream"
      variable to "targeted_match" to clarify the intention and avoid future
      issues.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      88faba20
  6. 12 6月, 2021 1 次提交
    • V
      net: dsa: generalize overhead for taggers that use both headers and trailers · 4e500251
      Vladimir Oltean 提交于
      Some really really weird switches just couldn't decide whether to use a
      normal or a tail tagger, so they just did both.
      
      This creates problems for DSA, because we only have the concept of an
      'overhead' which can be applied to the headroom or to the tailroom of
      the skb (like for example during the central TX reallocation procedure),
      depending on the value of bool tail_tag, but not to both.
      
      We need to generalize DSA to cater for these odd switches by
      transforming the 'overhead / tail_tag' pair into 'needed_headroom /
      needed_tailroom'.
      
      The DSA master's MTU is increased to account for both.
      
      The flow dissector code is modified such that it only calls the DSA
      adjustment callback if the tagger has a non-zero header length.
      
      Taggers are trivially modified to declare either needed_headroom or
      needed_tailroom, based on the tail_tag value that they currently
      declare.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4e500251
  7. 24 3月, 2021 2 次提交
  8. 19 3月, 2021 1 次提交
  9. 09 3月, 2021 1 次提交
    • V
      net: dsa: fix switchdev objects on bridge master mistakenly being applied on ports · 03cbb870
      Vladimir Oltean 提交于
      Tobias reports that after the blamed patch, VLAN objects being added to
      a bridge device are being added to all slave ports instead (swp2, swp3).
      
      ip link add br0 type bridge vlan_filtering 1
      ip link set swp2 master br0
      ip link set swp3 master br0
      bridge vlan add dev br0 vid 100 self
      
      This is because the fix was too broad: we made dsa_port_offloads_netdev
      say "yes, I offload the br0 bridge" for all slave ports, but we didn't
      add the checks whether the switchdev object was in fact meant for the
      physical port or for the bridge itself. So we are reacting on events in
      a way in which we shouldn't.
      
      The reason why the fix was too broad is because the question itself,
      "does this DSA port offload this netdev", was too broad in the first
      place. The solution is to disambiguate the question and separate it into
      two different functions, one to be called for each switchdev attribute /
      object that has an orig_dev == net_bridge (dsa_port_offloads_bridge),
      and the other for orig_dev == net_bridge_port (*_offloads_bridge_port).
      
      In the case of VLAN objects on the bridge interface, this solves the
      problem because we know that VLAN objects are per bridge port and not
      per bridge. And when orig_dev is equal to the net_bridge, we offload it
      as a bridge, but not as a bridge port; that's how we are able to skip
      reacting on those events. Note that this is compatible with future plans
      to have explicit offloading of VLAN objects on the bridge interface as a
      bridge port (in DSA, this signifies that we should add that VLAN towards
      the CPU port).
      
      Fixes: 99b8202b ("net: dsa: fix SWITCHDEV_ATTR_ID_BRIDGE_VLAN_FILTERING getting ignored")
      Reported-by: NTobias Waldekranz <tobias@waldekranz.com>
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NTobias Waldekranz <tobias@waldekranz.com>
      Tested-by: NTobias Waldekranz <tobias@waldekranz.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      03cbb870
  10. 17 2月, 2021 1 次提交
  11. 15 2月, 2021 2 次提交
  12. 13 2月, 2021 2 次提交
    • V
      net: dsa: act as passthrough for bridge port flags · a8b659e7
      Vladimir Oltean 提交于
      There are multiple ways in which a PORT_BRIDGE_FLAGS attribute can be
      expressed by the bridge through switchdev, and not all of them can be
      emulated by DSA mid-layer API at the same time.
      
      One possible configuration is when the bridge offloads the port flags
      using a mask that has a single bit set - therefore only one feature
      should change. However, DSA currently groups together unicast and
      multicast flooding in the .port_egress_floods method, which limits our
      options when we try to add support for turning off broadcast flooding:
      do we extend .port_egress_floods with a third parameter which b53 and
      mv88e6xxx will ignore? But that means that the DSA layer, which
      currently implements the PRE_BRIDGE_FLAGS attribute all by itself, will
      see that .port_egress_floods is implemented, and will report that all 3
      types of flooding are supported - not necessarily true.
      
      Another configuration is when the user specifies more than one flag at
      the same time, in the same netlink message. If we were to create one
      individual function per offloadable bridge port flag, we would limit the
      expressiveness of the switch driver of refusing certain combinations of
      flag values. For example, a switch may not have an explicit knob for
      flooding of unknown multicast, just for flooding in general. In that
      case, the only correct thing to do is to allow changes to BR_FLOOD and
      BR_MCAST_FLOOD in tandem, and never allow mismatched values. But having
      a separate .port_set_unicast_flood and .port_set_multicast_flood would
      not allow the driver to possibly reject that.
      
      Also, DSA doesn't consider it necessary to inform the driver that a
      SWITCHDEV_ATTR_ID_BRIDGE_MROUTER attribute was offloaded, because it
      just calls .port_egress_floods for the CPU port. When we'll add support
      for the plain SWITCHDEV_ATTR_ID_PORT_MROUTER, that will become a real
      problem because the flood settings will need to be held statefully in
      the DSA middle layer, otherwise changing the mrouter port attribute will
      impact the flooding attribute. And that's _assuming_ that the underlying
      hardware doesn't have anything else to do when a multicast router
      attaches to a port than flood unknown traffic to it.  If it does, there
      will need to be a dedicated .port_set_mrouter anyway.
      
      So we need to let the DSA drivers see the exact form that the bridge
      passes this switchdev attribute in, otherwise we are standing in the
      way. Therefore we also need to use this form of language when
      communicating to the driver that it needs to configure its initial
      (before bridge join) and final (after bridge leave) port flags.
      
      The b53 and mv88e6xxx drivers are converted to the passthrough API and
      their implementation of .port_egress_floods is split into two: a
      function that configures unicast flooding and another for multicast.
      The mv88e6xxx implementation is quite hairy, and it turns out that
      the implementations of unknown unicast flooding are actually the same
      for 6185 and for 6352:
      
      behind the confusing names actually lie two individual bits:
      NO_UNKNOWN_MC -> FLOOD_UC = 0x4 = BIT(2)
      NO_UNKNOWN_UC -> FLOOD_MC = 0x8 = BIT(3)
      
      so there was no reason to entangle them in the first place.
      
      Whereas the 6185 writes to MV88E6185_PORT_CTL0_FORWARD_UNKNOWN of
      PORT_CTL0, which has the exact same bit index. I have left the
      implementations separate though, for the only reason that the names are
      different enough to confuse me, since I am not able to double-check with
      a user manual. The multicast flooding setting for 6185 is in a different
      register than for 6352 though.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a8b659e7
    • V
      net: switchdev: pass flags and mask to both {PRE_,}BRIDGE_FLAGS attributes · e18f4c18
      Vladimir Oltean 提交于
      This switchdev attribute offers a counterproductive API for a driver
      writer, because although br_switchdev_set_port_flag gets passed a
      "flags" and a "mask", those are passed piecemeal to the driver, so while
      the PRE_BRIDGE_FLAGS listener knows what changed because it has the
      "mask", the BRIDGE_FLAGS listener doesn't, because it only has the final
      value. But certain drivers can offload only certain combinations of
      settings, like for example they cannot change unicast flooding
      independently of multicast flooding - they must be both on or both off.
      The way the information is passed to switchdev makes drivers not
      expressive enough, and unable to reject this request ahead of time, in
      the PRE_BRIDGE_FLAGS notifier, so they are forced to reject it during
      the deferred BRIDGE_FLAGS attribute, where the rejection is currently
      ignored.
      
      This patch also changes drivers to make use of the "mask" field for edge
      detection when possible.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NGrygorii Strashko <grygorii.strashko@ti.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e18f4c18
  13. 12 2月, 2021 1 次提交
  14. 07 2月, 2021 1 次提交
    • V
      net: dsa: make assisted_learning_on_cpu_port bypass offloaded LAG interfaces · a324d3d4
      Vladimir Oltean 提交于
      Given the following topology, and focusing only on Box A:
      
               Box A
               +----------------------------------+
               | Board 1         br0              |
               |             +---------+          |
               |            /           \         |
               |            |           |         |
               |            |         bond0       |
               |            |        +-----+      |
               |192.168.1.1 |       /       \     |
               |  eno0     swp0    swp1    swp2   |
               +---|--------|-------|-------|-----+
                   |        |       |       |
                   +--------+       |       |
                     Cable          |       |
                               Cable|       |Cable
                     Cable          |       |
                   +--------+       |       |
                   |        |       |       |
               +---|--------|-------|-------|-----+
               |  eno0     swp0    swp1    swp2   |
               |192.168.1.2 |       \       /     |
               |            |        +-----+      |
               |            |         bond0       |
               |            |           |         |
               |            \           /         |
               |             +---------+          |
               | Board 2         br0              |
               +----------------------------------+
               Box B
      
      The assisted_learning_on_cpu_port logic will see that swp0 is bridged
      with a "foreign interface" (bond0) and will therefore install all
      addresses learnt by the software bridge towards bond0 (including the
      address of eno0 on Box B) as static addresses towards the CPU port.
      
      But that's not what we want - bond0 is not really a "foreign interface"
      but one we can offload including L2 forwarding from/towards it. So we
      need to refine our logic for assisted learning such that, whenever we
      see an address learnt on a non-DSA interface, we search through the tree
      for any port that offloads that non-DSA interface.
      
      Some confusion might arise as to why we search through the whole tree
      instead of just the local switch returned by dsa_slave_dev_lower_find.
      Or a different angle of the same confusion: why does
      dsa_slave_dev_lower_find(br_dev) return a single dp that's under br_dev
      instead of the whole list of bridged DSA ports?
      
      To answer the second question, it should be enough to install the static
      FDB entry on the CPU port of a single switch in the tree, because
      dsa_port_fdb_add uses DSA_NOTIFIER_FDB_ADD which ensures that all other
      switches in the tree get notified of that address, and add the entry
      themselves using dsa_towards_port().
      
      This should help understand the answer to the first question: the port
      returned by dsa_slave_dev_lower_find may not be on the same switch as
      the ports that offload the LAG. Nonetheless, if the driver implements
      .crosschip_lag_join and .crosschip_bridge_join as mv88e6xxx does, there
      still isn't any reason for trapping addresses learnt on the remote LAG
      towards the CPU, and we should prevent that.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      a324d3d4
  15. 04 2月, 2021 1 次提交
    • V
      net: dsa: fix SWITCHDEV_ATTR_ID_BRIDGE_VLAN_FILTERING getting ignored · 99b8202b
      Vladimir Oltean 提交于
      The bridge emits VLAN filtering events and quite a few others via
      switchdev with orig_dev = br->dev. After the blamed commit, these events
      started getting ignored.
      
      The point of the patch was to not offload switchdev objects for ports
      that didn't go through dsa_port_bridge_join, because the configuration
      is unsupported:
      - ports that offload a bonding/team interface go through
        dsa_port_bridge_join when that bonding/team interface is later bridged
        with another switch port or LAG
      - ports that don't offload LAG don't get notified of the bridge that is
        on top of that LAG.
      
      Sadly, a check is missing, which is that the orig_dev is equal to the
      bridge device. This check is compatible with the original intention,
      because ports that don't offload bridging because they use a software
      LAG don't have dp->bridge_dev set.
      
      On a semi-related note, we should not offload switchdev objects or
      populate dp->bridge_dev if the driver doesn't implement .port_bridge_join
      either. However there is no regression associated with that, so it can
      be done separately.
      
      Fixes: 5696c8ae ("net: dsa: Don't offload port attributes on standalone ports")
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NTobias Waldekranz <tobias@waldekranz.com>
      Tested-by: NTobias Waldekranz <tobias@waldekranz.com>
      Link: https://lore.kernel.org/r/20210202233109.1591466-1-olteanv@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      99b8202b
  16. 30 1月, 2021 2 次提交
    • V
      net: dsa: allow changing the tag protocol via the "tagging" device attribute · 53da0eba
      Vladimir Oltean 提交于
      Currently DSA exposes the following sysfs:
      $ cat /sys/class/net/eno2/dsa/tagging
      ocelot
      
      which is a read-only device attribute, introduced in the kernel as
      commit 98cdb480 ("net: dsa: Expose tagging protocol to user-space"),
      and used by libpcap since its commit 993db3800d7d ("Add support for DSA
      link-layer types").
      
      It would be nice if we could extend this device attribute by making it
      writable:
      $ echo ocelot-8021q > /sys/class/net/eno2/dsa/tagging
      
      This is useful with DSA switches that can make use of more than one
      tagging protocol. It may be useful in dsa_loop in the future too, to
      perform offline testing of various taggers, or for changing between dsa
      and edsa on Marvell switches, if that is desirable.
      
      In terms of implementation, drivers can support this feature by
      implementing .change_tag_protocol, which should always leave the switch
      in a consistent state: either with the new protocol if things went well,
      or with the old one if something failed. Teardown of the old protocol,
      if necessary, must be handled by the driver.
      
      Some things remain as before:
      - The .get_tag_protocol is currently only called at probe time, to load
        the initial tagging protocol driver. Nonetheless, new drivers should
        report the tagging protocol in current use now.
      - The driver should manage by itself the initial setup of tagging
        protocol, no later than the .setup() method, as well as destroying
        resources used by the last tagger in use, no earlier than the
        .teardown() method.
      
      For multi-switch DSA trees, error handling is a bit more complicated,
      since e.g. the 5th out of 7 switches may fail to change the tag
      protocol. When that happens, a revert to the original tag protocol is
      attempted, but that may fail too, leaving the tree in an inconsistent
      state despite each individual switch implementing .change_tag_protocol
      transactionally. Since the intersection between drivers that implement
      .change_tag_protocol and drivers that support D in DSA is currently the
      empty set, the possibility for this error to happen is ignored for now.
      
      Testing:
      
      $ insmod mscc_felix.ko
      [   79.549784] mscc_felix 0000:00:00.5: Adding to iommu group 14
      [   79.565712] mscc_felix 0000:00:00.5: Failed to register DSA switch: -517
      $ insmod tag_ocelot.ko
      $ rmmod mscc_felix.ko
      $ insmod mscc_felix.ko
      [   97.261724] libphy: VSC9959 internal MDIO bus: probed
      [   97.267363] mscc_felix 0000:00:00.5: Found PCS at internal MDIO address 0
      [   97.274998] mscc_felix 0000:00:00.5: Found PCS at internal MDIO address 1
      [   97.282561] mscc_felix 0000:00:00.5: Found PCS at internal MDIO address 2
      [   97.289700] mscc_felix 0000:00:00.5: Found PCS at internal MDIO address 3
      [   97.599163] mscc_felix 0000:00:00.5 swp0 (uninitialized): PHY [0000:00:00.3:10] driver [Microsemi GE VSC8514 SyncE] (irq=POLL)
      [   97.862034] mscc_felix 0000:00:00.5 swp1 (uninitialized): PHY [0000:00:00.3:11] driver [Microsemi GE VSC8514 SyncE] (irq=POLL)
      [   97.950731] mscc_felix 0000:00:00.5 swp0: configuring for inband/qsgmii link mode
      [   97.964278] 8021q: adding VLAN 0 to HW filter on device swp0
      [   98.146161] mscc_felix 0000:00:00.5 swp2 (uninitialized): PHY [0000:00:00.3:12] driver [Microsemi GE VSC8514 SyncE] (irq=POLL)
      [   98.238649] mscc_felix 0000:00:00.5 swp1: configuring for inband/qsgmii link mode
      [   98.251845] 8021q: adding VLAN 0 to HW filter on device swp1
      [   98.433916] mscc_felix 0000:00:00.5 swp3 (uninitialized): PHY [0000:00:00.3:13] driver [Microsemi GE VSC8514 SyncE] (irq=POLL)
      [   98.485542] mscc_felix 0000:00:00.5: configuring for fixed/internal link mode
      [   98.503584] mscc_felix 0000:00:00.5: Link is Up - 2.5Gbps/Full - flow control rx/tx
      [   98.527948] device eno2 entered promiscuous mode
      [   98.544755] DSA: tree 0 setup
      
      $ ping 10.0.0.1
      PING 10.0.0.1 (10.0.0.1): 56 data bytes
      64 bytes from 10.0.0.1: seq=0 ttl=64 time=2.337 ms
      64 bytes from 10.0.0.1: seq=1 ttl=64 time=0.754 ms
      ^C
       -  10.0.0.1 ping statistics  -
      2 packets transmitted, 2 packets received, 0% packet loss
      round-trip min/avg/max = 0.754/1.545/2.337 ms
      
      $ cat /sys/class/net/eno2/dsa/tagging
      ocelot
      $ cat ./test_ocelot_8021q.sh
              #!/bin/bash
      
              ip link set swp0 down
              ip link set swp1 down
              ip link set swp2 down
              ip link set swp3 down
              ip link set swp5 down
              ip link set eno2 down
              echo ocelot-8021q > /sys/class/net/eno2/dsa/tagging
              ip link set eno2 up
              ip link set swp0 up
              ip link set swp1 up
              ip link set swp2 up
              ip link set swp3 up
              ip link set swp5 up
      $ ./test_ocelot_8021q.sh
      ./test_ocelot_8021q.sh: line 9: echo: write error: Protocol not available
      $ rmmod tag_ocelot.ko
      rmmod: can't unload module 'tag_ocelot': Resource temporarily unavailable
      $ insmod tag_ocelot_8021q.ko
      $ ./test_ocelot_8021q.sh
      $ cat /sys/class/net/eno2/dsa/tagging
      ocelot-8021q
      $ rmmod tag_ocelot.ko
      $ rmmod tag_ocelot_8021q.ko
      rmmod: can't unload module 'tag_ocelot_8021q': Resource temporarily unavailable
      $ ping 10.0.0.1
      PING 10.0.0.1 (10.0.0.1): 56 data bytes
      64 bytes from 10.0.0.1: seq=0 ttl=64 time=0.953 ms
      64 bytes from 10.0.0.1: seq=1 ttl=64 time=0.787 ms
      64 bytes from 10.0.0.1: seq=2 ttl=64 time=0.771 ms
      $ rmmod mscc_felix.ko
      [  645.544426] mscc_felix 0000:00:00.5: Link is Down
      [  645.838608] DSA: tree 0 torn down
      $ rmmod tag_ocelot_8021q.ko
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      53da0eba
    • V
      net: dsa: document the existing switch tree notifiers and add a new one · 886f8e26
      Vladimir Oltean 提交于
      The existence of dsa_broadcast has generated some confusion in the past:
      https://www.mail-archive.com/netdev@vger.kernel.org/msg365042.html
      
      So let's document the existing dsa_port_notify and dsa_broadcast
      functions and explain when each of them should be used.
      
      Also, in fact, the in-between function has always been there but was
      lacking a name, and is the main reason for this patch: dsa_tree_notify.
      Refactor dsa_broadcast to use it.
      
      This patch also moves dsa_broadcast (a top-level function) to dsa2.c,
      where it really belonged in the first place, but had no companion so it
      stood with dsa_port_notify.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      886f8e26
  17. 15 1月, 2021 1 次提交
    • T
      net: dsa: Link aggregation support · 058102a6
      Tobias Waldekranz 提交于
      Monitor the following events and notify the driver when:
      
      - A DSA port joins/leaves a LAG.
      - A LAG, made up of DSA ports, joins/leaves a bridge.
      - A DSA port in a LAG is enabled/disabled (enabled meaning
        "distributing" in 802.3ad LACP terms).
      
      When a LAG joins a bridge, the DSA subsystem will treat that as each
      individual port joining the bridge. The driver may look at the port's
      LAG device pointer to see if it is associated with any LAG, if that is
      required. This is analogue to how switchdev events are replicated out
      to all lower devices when reaching e.g. a LAG.
      
      Drivers can optionally request that DSA maintain a linear mapping from
      a LAG ID to the corresponding netdev by setting ds->num_lag_ids to the
      desired size.
      
      In the event that the hardware is not capable of offloading a
      particular LAG for any reason (the typical case being use of exotic
      modes like broadcast), DSA will take a hands-off approach, allowing
      the LAG to be formed as a pure software construct. This is reported
      back through the extended ACK, but is otherwise transparent to the
      user.
      Signed-off-by: NTobias Waldekranz <tobias@waldekranz.com>
      Reviewed-by: NVladimir Oltean <olteanv@gmail.com>
      Tested-by: NVladimir Oltean <olteanv@gmail.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      058102a6
  18. 12 1月, 2021 3 次提交
    • V
      net: dsa: remove the transactional logic from ageing time notifiers · 77b61365
      Vladimir Oltean 提交于
      Remove the shim introduced in DSA for offloading the bridge ageing time
      from switchdev, by first checking whether the ageing time is within the
      range limits requested by the driver.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Acked-by: NLinus Walleij <linus.walleij@linaro.org>
      Acked-by: NJiri Pirko <jiri@nvidia.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      77b61365
    • V
      net: switchdev: remove the transaction structure from port attributes · bae33f2b
      Vladimir Oltean 提交于
      Since the introduction of the switchdev API, port attributes were
      transmitted to drivers for offloading using a two-step transactional
      model, with a prepare phase that was supposed to catch all errors, and a
      commit phase that was supposed to never fail.
      
      Some classes of failures can never be avoided, like hardware access, or
      memory allocation. In the latter case, merely attempting to move the
      memory allocation to the preparation phase makes it impossible to avoid
      memory leaks, since commit 91cf8ece ("switchdev: Remove unused
      transaction item queue") which has removed the unused mechanism of
      passing on the allocated memory between one phase and another.
      
      It is time we admit that separating the preparation from the commit
      phase is something that is best left for the driver to decide, and not
      something that should be baked into the API, especially since there are
      no switchdev callers that depend on this.
      
      This patch removes the struct switchdev_trans member from switchdev port
      attribute notifier structures, and converts drivers to not look at this
      member.
      
      In part, this patch contains a revert of my previous commit 2e554a7a
      ("net: dsa: propagate switchdev vlan_filtering prepare phase to
      drivers").
      
      For the most part, the conversion was trivial except for:
      - Rocker's world implementation based on Broadcom OF-DPA had an odd
        implementation of ofdpa_port_attr_bridge_flags_set. The conversion was
        done mechanically, by pasting the implementation twice, then only
        keeping the code that would get executed during prepare phase on top,
        then only keeping the code that gets executed during the commit phase
        on bottom, then simplifying the resulting code until this was obtained.
      - DSA's offloading of STP state, bridge flags, VLAN filtering and
        multicast router could be converted right away. But the ageing time
        could not, so a shim was introduced and this was left for a further
        commit.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Acked-by: NLinus Walleij <linus.walleij@linaro.org>
      Acked-by: NJiri Pirko <jiri@nvidia.com>
      Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> # hellcreek
      Reviewed-by: Linus Walleij <linus.walleij@linaro.org> # RTL8366RB
      Reviewed-by: NIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      bae33f2b
    • V
      net: switchdev: remove the transaction structure from port object notifiers · ffb68fc5
      Vladimir Oltean 提交于
      Since the introduction of the switchdev API, port objects were
      transmitted to drivers for offloading using a two-step transactional
      model, with a prepare phase that was supposed to catch all errors, and a
      commit phase that was supposed to never fail.
      
      Some classes of failures can never be avoided, like hardware access, or
      memory allocation. In the latter case, merely attempting to move the
      memory allocation to the preparation phase makes it impossible to avoid
      memory leaks, since commit 91cf8ece ("switchdev: Remove unused
      transaction item queue") which has removed the unused mechanism of
      passing on the allocated memory between one phase and another.
      
      It is time we admit that separating the preparation from the commit
      phase is something that is best left for the driver to decide, and not
      something that should be baked into the API, especially since there are
      no switchdev callers that depend on this.
      
      This patch removes the struct switchdev_trans member from switchdev port
      object notifier structures, and converts drivers to not look at this
      member.
      
      Where driver conversion is trivial (like in the case of the Marvell
      Prestera driver, NXP DPAA2 switch, TI CPSW, and Rocker drivers), it is
      done in this patch.
      
      Where driver conversion needs more attention (DSA, Mellanox Spectrum),
      the conversion is left for subsequent patches and here we only fake the
      prepare/commit phases at a lower level, just not in the switchdev
      notifier itself.
      
      Where the code has a natural structure that is best left alone as a
      preparation and a commit phase (as in the case of the Ocelot switch),
      that structure is left in place, just made to not depend upon the
      switchdev transactional model.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Acked-by: NLinus Walleij <linus.walleij@linaro.org>
      Acked-by: NJiri Pirko <jiri@nvidia.com>
      Reviewed-by: NIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      ffb68fc5
  19. 10 1月, 2021 1 次提交
  20. 08 1月, 2021 2 次提交
  21. 10 11月, 2020 1 次提交
  22. 03 10月, 2020 2 次提交
  23. 24 9月, 2020 1 次提交
    • V
      net: dsa: untag the bridge pvid from rx skbs · 412a1526
      Vladimir Oltean 提交于
      Currently the bridge untags VLANs present in its VLAN groups in
      __allowed_ingress() only when VLAN filtering is enabled.
      
      But when a skb is seen on the RX path as tagged with the bridge's pvid,
      and that bridge has vlan_filtering=0, and there isn't any 8021q upper
      with that VLAN either, then we have a problem. The bridge will not untag
      it (since it is supposed to remain VLAN-unaware), and pvid-tagged
      communication will be broken.
      
      There are 2 situations where we can end up like that:
      
      1. When installing a pvid in egress-tagged mode, like this:
      
      ip link add dev br0 type bridge vlan_filtering 0
      ip link set swp0 master br0
      bridge vlan del dev swp0 vid 1
      bridge vlan add dev swp0 vid 1 pvid
      
      This happens because DSA configures the VLAN membership of the CPU port
      using the same flags as swp0 (in this case "pvid and not untagged"), in
      an attempt to copy the frame as-is from ingress to the CPU.
      
      However, in this case, the packet may arrive untagged on ingress, it
      will be pvid-tagged by the ingress port, and will be sent as
      egress-tagged towards the CPU. Otherwise stated, the CPU will see a VLAN
      tag where there was none to speak of on ingress.
      
      When vlan_filtering is 1, this is not a problem, as stated in the first
      paragraph, because __allowed_ingress() will pop it. But currently, when
      vlan_filtering is 0 and we have such a VLAN configuration, we need an
      8021q upper (br0.1) to be able to ping over that VLAN, which is not
      symmetrical with the vlan_filtering=1 case, and therefore, confusing for
      users.
      
      Basically what DSA attempts to do is simply an approximation: try to
      copy the skb with (or without) the same VLAN all the way up to the CPU.
      But DSA drivers treat CPU port VLAN membership in various ways (which is
      a good segue into situation 2). And some of those drivers simply tell
      the CPU port to copy the frame unmodified, which is the golden standard
      when it comes to VLAN processing (therefore, any driver which can
      configure the hardware to do that, should do that, and discard the VLAN
      flags requested by DSA on the CPU port).
      
      2. Some DSA drivers always configure the CPU port as egress-tagged, in
      an attempt to recover the classified VLAN from the skb. These drivers
      cannot work at all with untagged traffic when bridged in
      vlan_filtering=0 mode. And they can't go for the easy "just keep the
      pvid as egress-untagged towards the CPU" route, because each front port
      can have its own pvid, and that might require conflicting VLAN
      membership settings on the CPU port (swp1 is pvid for VID 1 and
      egress-tagged for VID 2; swp2 is egress-taggeed for VID 1 and pvid for
      VID 2; with this simplistic approach, the CPU port, which is really a
      separate hardware entity and has its own VLAN membership settings, would
      end up being egress-untagged in both VID 1 and VID 2, therefore losing
      the VLAN tags of ingress traffic).
      
      So the only thing we can do is to create a helper function for resolving
      the problematic case (that is, a function which untags the bridge pvid
      when that is in vlan_filtering=0 mode), which taggers in need should
      call. It isn't called from the generic DSA receive path because there
      are drivers that fall neither in the first nor second category.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      412a1526
  24. 12 9月, 2020 1 次提交
  25. 06 7月, 2020 1 次提交
  26. 13 5月, 2020 1 次提交
    • R
      net: dsa: provide an option for drivers to always receive bridge VLANs · 54a0ed0d
      Russell King 提交于
      DSA assumes that a bridge which has vlan filtering disabled is not
      vlan aware, and ignores all vlan configuration. However, the kernel
      software bridge code allows configuration in this state.
      
      This causes the kernel's idea of the bridge vlan state and the
      hardware state to disagree, so "bridge vlan show" indicates a correct
      configuration but the hardware lacks all configuration. Even worse,
      enabling vlan filtering on a DSA bridge immediately blocks all traffic
      which, given the output of "bridge vlan show", is very confusing.
      
      Provide an option that drivers can set to indicate they want to receive
      vlan configuration even when vlan filtering is disabled. At the very
      least, this is safe for Marvell DSA bridges, which do not look up
      ingress traffic in the VTU if the port is in 8021Q disabled state. It is
      also safe for the Ocelot switch family. Whether this change is suitable
      for all DSA bridges is not known.
      Signed-off-by: NRussell King <rmk+kernel@armlinux.org.uk>
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      54a0ed0d
  27. 11 5月, 2020 1 次提交
    • V
      net: dsa: permit cross-chip bridging between all trees in the system · f66a6a69
      Vladimir Oltean 提交于
      One way of utilizing DSA is by cascading switches which do not all have
      compatible taggers. Consider the following real-life topology:
      
            +---------------------------------------------------------------+
            | LS1028A                                                       |
            |               +------------------------------+                |
            |               |      DSA master for Felix    |                |
            |               |(internal ENETC port 2: eno2))|                |
            |  +------------+------------------------------+-------------+  |
            |  | Felix embedded L2 switch                                |  |
            |  |                                                         |  |
            |  | +--------------+   +--------------+   +--------------+  |  |
            |  | |DSA master for|   |DSA master for|   |DSA master for|  |  |
            |  | |  SJA1105 1   |   |  SJA1105 2   |   |  SJA1105 3   |  |  |
            |  | |(Felix port 1)|   |(Felix port 2)|   |(Felix port 3)|  |  |
            +--+-+--------------+---+--------------+---+--------------+--+--+
      
      +-----------------------+ +-----------------------+ +-----------------------+
      |   SJA1105 switch 1    | |   SJA1105 switch 2    | |   SJA1105 switch 3    |
      +-----+-----+-----+-----+ +-----+-----+-----+-----+ +-----+-----+-----+-----+
      |sw1p0|sw1p1|sw1p2|sw1p3| |sw2p0|sw2p1|sw2p2|sw2p3| |sw3p0|sw3p1|sw3p2|sw3p3|
      +-----+-----+-----+-----+ +-----+-----+-----+-----+ +-----+-----+-----+-----+
      
      The above can be described in the device tree as follows (obviously not
      complete):
      
      mscc_felix {
      	dsa,member = <0 0>;
      	ports {
      		port@4 {
      			ethernet = <&enetc_port2>;
      		};
      	};
      };
      
      sja1105_switch1 {
      	dsa,member = <1 1>;
      	ports {
      		port@4 {
      			ethernet = <&mscc_felix_port1>;
      		};
      	};
      };
      
      sja1105_switch2 {
      	dsa,member = <2 2>;
      	ports {
      		port@4 {
      			ethernet = <&mscc_felix_port2>;
      		};
      	};
      };
      
      sja1105_switch3 {
      	dsa,member = <3 3>;
      	ports {
      		port@4 {
      			ethernet = <&mscc_felix_port3>;
      		};
      	};
      };
      
      Basically we instantiate one DSA switch tree for every hardware switch
      in the system, but we still give them globally unique switch IDs (will
      come back to that later). Having 3 disjoint switch trees makes the
      tagger drivers "just work", because net devices are registered for the
      3 Felix DSA master ports, and they are also DSA slave ports to the ENETC
      port. So packets received on the ENETC port are stripped of their
      stacked DSA tags one by one.
      
      Currently, hardware bridging between ports on the same sja1105 chip is
      possible, but switching between sja1105 ports on different chips is
      handled by the software bridge. This is fine, but we can do better.
      
      In fact, the dsa_8021q tag used by sja1105 is compatible with cascading.
      In other words, a sja1105 switch can correctly parse and route a packet
      containing a dsa_8021q tag. So if we could enable hardware bridging on
      the Felix DSA master ports, cross-chip bridging could be completely
      offloaded.
      
      Such as system would be used as follows:
      
      ip link add dev br0 type bridge && ip link set dev br0 up
      for port in sw0p0 sw0p1 sw0p2 sw0p3 \
      	    sw1p0 sw1p1 sw1p2 sw1p3 \
      	    sw2p0 sw2p1 sw2p2 sw2p3; do
      	ip link set dev $port master br0
      done
      
      The above makes switching between ports on the same row be performed in
      hardware, and between ports on different rows in software. Now assume
      the Felix switch ports are called swp0, swp1, swp2. By running the
      following extra commands:
      
      ip link add dev br1 type bridge && ip link set dev br1 up
      for port in swp0 swp1 swp2; do
      	ip link set dev $port master br1
      done
      
      the CPU no longer sees packets which traverse sja1105 switch boundaries
      and can be forwarded directly by Felix. The br1 bridge would not be used
      for any sort of traffic termination.
      
      For this to work, we need to give drivers an opportunity to listen for
      bridging events on DSA trees other than their own, and pass that other
      tree index as argument. I have made the assumption, for the moment, that
      the other existing DSA notifiers don't need to be broadcast to other
      trees. That assumption might turn out to be incorrect. But in the
      meantime, introduce a dsa_broadcast function, similar in purpose to
      dsa_port_notify, which is used only by the bridging notifiers.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      f66a6a69
  28. 24 4月, 2020 1 次提交
    • A
      net: dsa: add GRO support via gro_cells · e131a563
      Alexander Lobakin 提交于
      gro_cells lib is used by different encapsulating netdevices, such as
      geneve, macsec, vxlan etc. to speed up decapsulated traffic processing.
      CPU tag is a sort of "encapsulation", and we can use the same mechs to
      greatly improve overall DSA performance.
      skbs are passed to the GRO layer after removing CPU tags, so we don't
      need any new packet offload types as it was firstly proposed by me in
      the first GRO-over-DSA variant [1].
      
      The size of struct gro_cells is sizeof(void *), so hot struct
      dsa_slave_priv becomes only 4/8 bytes bigger, and all critical fields
      remain in one 32-byte cacheline.
      The other positive side effect is that drivers for network devices
      that can be shipped as CPU ports of DSA-driven switches can now use
      napi_gro_frags() to pass skbs to kernel. Packets built that way are
      completely non-linear and are likely being dropped without GRO.
      
      This was tested on to-be-mainlined-soon Ethernet driver that uses
      napi_gro_frags(), and the overall performance was on par with the
      variant from [1], sometimes even better due to minimal overhead.
      net.core.gro_normal_batch tuning may help to push it to the limit
      on particular setups and platforms.
      
      iperf3 IPoE VLAN NAT TCP forwarding (port1.218 -> port0) setup
      on 1.2 GHz MIPS board:
      
      5.7-rc2 baseline:
      
      [ID]  Interval         Transfer     Bitrate        Retr
      [ 5]  0.00-120.01 sec  9.00 GBytes  644 Mbits/sec  413  sender
      [ 5]  0.00-120.00 sec  8.99 GBytes  644 Mbits/sec       receiver
      
      Iface      RX packets  TX packets
      eth0       7097731     7097702
      port0      426050      6671829
      port1      6671681     425862
      port1.218  6671677     425851
      
      With this patch:
      
      [ID]  Interval         Transfer     Bitrate        Retr
      [ 5]  0.00-120.01 sec  12.2 GBytes  870 Mbits/sec  122  sender
      [ 5]  0.00-120.00 sec  12.2 GBytes  870 Mbits/sec       receiver
      
      Iface      RX packets  TX packets
      eth0       9474792     9474777
      port0      455200      353288
      port1      9019592     455035
      port1.218  353144      455024
      
      v2:
       - Add some performance examples in the commit message;
       - No functional changes.
      
      [1] https://lore.kernel.org/netdev/20191230143028.27313-1-alobakin@dlink.ru/Signed-off-by: NAlexander Lobakin <bloodyreaper@yandex.ru>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e131a563