1. 04 7月, 2022 2 次提交
  2. 29 6月, 2022 9 次提交
    • A
      mlxsw: spectrum_switchdev: Convert MDB code to use PGT APIs · e28cd993
      Amit Cohen 提交于
      The previous patches added common APIs for maintaining PGT (Port Group
      Table) table. In the legacy model, software did not interact with this
      table directly. Instead, it was accessed by firmware in response to
      registers such as SFTR and SMID. In the new model, software has full
      control over the PGT table using the SMID register.
      
      The configuration of MDB entries is already done via SMID, so the new
      PGT APIs can be used also using the legacy model, the only difference is
      that MID index should be aligned to bridge model. See a previous patch
      which added API for that.
      
      The main changes are:
      - MDB code does not maintain bitmap of ports in MDB entry anymore, instead,
        it stores a list of ports with additional information.
      - MDB code does not configure SMID register directly anymore, it will be
        done via PGT API when port is first added or removed.
      - Today MDB code does not update SMID when port is added/removed while
        multicast is disabled. Instead, it maintains bitmap of ports and once
        multicast is enabled, it rewrite the entry to hardware. Using PGT APIs,
        the entry will be updated also when multicast is disabled, but the
        mapping between {MAC, FID}->{MID} will not appear in SFD register. It
        means that SMID will be updated all the time and disable/enable multicast
        will impact only SFD configuration.
      - For multicast router, today only SMID is updated and the bitmap is not
        updated. Using the new list of ports, there is a reference count for each
        port, so it can be saved in software also. For such port,
        'struct mlxsw_sp_mdb_entry.ports_count' will not be updated and the
        port in the list will be marked as 'mrouter'.
      - Finally, `struct mlxsw_sp_mid.in_hw` is not needed anymore.
      Signed-off-by: NAmit Cohen <amcohen@nvidia.com>
      Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e28cd993
    • A
      mlxsw: spectrum_switchdev: Flush port from MDB entries according to FID index · 4c3f7442
      Amit Cohen 提交于
      Currently, flushing port from all MDB entries is done when the last VLAN
      is removed. This behavior is inaccurate, as port can be removed while there
      is another port which uses the same VLAN, in such case, this is not the
      last port which uses this VLAN and removed, but this port is supposed to be
      removed from the MDB entries.
      
      Flush the port from MDB when it is removed, regardless the state of other
      ports. Flush only the MDB entries which are relevant for the same FID
      index.
      Signed-off-by: NAmit Cohen <amcohen@nvidia.com>
      Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4c3f7442
    • A
      mlxsw: spectrum_switchdev: Add support for getting and putting MDB entry · 7434ed61
      Amit Cohen 提交于
      A previous patch added support for init() and fini() for MDB entries. MDB
      entry can be updated, ports can be added and removed from the entry. Add
      get() and put() functions, the first one checks if the entry already exists
      and otherwise initializes the entry. The second removes the entry just in
      case that there are no more ports in this entry.
      
      Use the list of the ports which was added in a previous patch. When the
      list contains only one port which is not multicast router, and this port
      is removed, the MDB entry can be removed. Use
      'struct mlxsw_sp_mdb_entry.ports_count' to know how many ports use the
      entry, regardless the use of multicast router ports.
      
      When mlxsw_sp_mc_mdb_entry_put() is called with specific port which
      supposed to be removed, check if the removal will cause a deletion of
      the entry. If this is the case, call mlxsw_sp_mc_mdb_entry_fini() which
      first deletes the MDB entry and then releases the PGT entry, to avoid a
      temporary situation in which the MDB entry points to an empty PGT entry,
      as otherwise packets will be temporarily dropped instead of being flooded.
      
      The new functions will be used in the next patches.
      Signed-off-by: NAmit Cohen <amcohen@nvidia.com>
      Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7434ed61
    • A
      mlxsw: spectrum_switchdev: Implement mlxsw_sp_mc_mdb_entry_{init, fini}() · ea0f58d6
      Amit Cohen 提交于
      The next patches will convert MDB code to use PGT APIs. The change will
      move the responsibility of allocating MID indexes and writing PGT
      configurations to hardware to PGT code. As part of this change, most of the
      MDB code will be changed and improved.
      
      As a preparation for the above mentioned change, implement
      mlxsw_sp_mc_mdb_entry_{init, fini}(). Currently, there is a function
      __mlxsw_sp_mc_alloc(), which does not only allocate MID. In addition,
      there is no an equivalent function to free the MID. When
      mlxsw_sp_port_remove_from_mid() removes the last port, it handles MID
      removal. Instead, add init() and fini() functions, which use PGT APIs.
      
      The differences between the existing and the new functions are as follows:
      1. Today MDB code does not update SMID when port is added/removed while
         multicast is disabled. It maintains a bitmap of ports and once multicast
         is enabled, it writes the entry to hardware. Instead, using PGT APIs,
         the entry will be updated also when multicast is disabled, but the
         mapping between {MAC, FID}->{MID} (is configured using SFD) will be
         updated according to multicast state. It means that SMID will be updated
         all the time and disable/enable multicast will impact only SFD
         configuration.
      
      2. Today the allocation of MID index is done as part of
         mlxsw_sp_mc_write_mdb_entry(). The fact that the entry will be
         written in hardware all the time, moves the allocation of the index to
         be as part of the MDB entry initialization. PGT API is used for the
         allocation.
      
      3. Today the update of multicast router ports is done as part of
         mlxsw_sp_mc_write_mdb_entry(). Instead, add functions to add/remove
         all multicast router ports when entry is first added or removed. When
         new multicast router port will be added/removed, the dedicated API will
         be used to add/remove it from the existing entries.
      
      4. A list of ports will be stored per MDB entry instead of the exiting
         bitmap. The list will contain the multicast router ports and maintain
         reference counter per port.
      
      Add mlxsw_sp_mdb_entry_write() which is almost identical to
      mlxsw_sp_port_mdb_op(). Use more clear name and align the MID index to
      bridge model using PGT API. The existing function will be removed in the
      next patches.
      
      Note that PGT APIs configure the firmware using SMID register, like the
      driver already does today for MDB entries, so PGT APIs can be used also
      using legacy bridge model.
      Signed-off-by: NAmit Cohen <amcohen@nvidia.com>
      Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ea0f58d6
    • A
      mlxsw: spectrum_switchdev: Add support for maintaining list of ports per MDB entry · d2994e13
      Amit Cohen 提交于
      As part of converting MDB code to use PGT APIs, PGT code stores which ports
      are mapped to each PGT entry. PGT code is not aware of the type of the port
      (multicast router or not), as it is not relevant there.
      
      To be able to release an MDB entry when the there are no ports which are
      not multicast routers, the entry should be aware of the state of its
      ports. Add support for maintaining list of ports per MDB entry.
      
      Each port will hold a reference count as multiple MDB entries can use the
      same hardware MDB entry. It occurs because MDB entries in the Linux bridge
      are keyed according to their multicast IP, when these entries are notified
      to device drivers via switchdev, the multicast IP is converted to a
      multicast MAC. This conversion might cause collisions, for example,
      ff0e::1 and ff0e:1234::1 are both mapped to the multicast MAC
      33:33:00:00:00:01.
      
      Multicast router port will take a reference once, and will be marked as
      'mrouter', then when port in the list is multicast router and its
      reference value is one, it means that the entry can be removed in case
      that there are no other ports which are not multicast routers. For that,
      maintain a counter per MDB entry to count ports in the list, which were
      added to the multicast group, and not because they are multicast routers.
      When this counter is zero, the entry can be removed.
      
      Add mlxsw_sp_mdb_entry_port_{get,put}() for regular ports and
      mlxsw_sp_mdb_entry_mrouter_port_{get,put}() for multicast router ports.
      Call PGT API to add or remove port from PGT entry when port is first added
      or removed, according to the reference counting.
      Signed-off-by: NAmit Cohen <amcohen@nvidia.com>
      Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d2994e13
    • A
      mlxsw: spectrum_switchdev: Add support for maintaining hash table of MDB entries · 5d0512e5
      Amit Cohen 提交于
      Currently MDB entries are stored in a list as part of
      'struct mlxsw_sp_bridge_device'. Storing them in a hash table in
      addition to the list will allow finding a specific entry more efficiently.
      
      Add support for the required hash table, the next patches will insert
      and remove MDB entries from the table. The existing code which adds and
      removes entries will be removed and replaced by new code in the next
      patches, so there is no point to adjust the existing code.
      Signed-off-by: NAmit Cohen <amcohen@nvidia.com>
      Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5d0512e5
    • A
      mlxsw: spectrum_switchdev: Save MAC and FID as a key in 'struct mlxsw_sp_mdb_entry' · 0ac98543
      Amit Cohen 提交于
      The next patch will add support for storing all the MDB entries in a hash
      table. As a preparation, save the MAC address and the FID in a
      separate structure. This structure will be used later as a key for the
      hash table.
      Signed-off-by: NAmit Cohen <amcohen@nvidia.com>
      Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0ac98543
    • A
      mlxsw: spectrum_switchdev: Rename MIDs list · eaa0791a
      Amit Cohen 提交于
      Currently, the list which stores the MDB entries for a given bridge
      instance is called 'mids_list'.
      
      This name is not accurate as a MID entry stores a bitmap of ports to
      which a packet needs to be replicated and a MDB entry stores the mapping
      from {MAC, FID} to PGT index (MID)
      
      Rename it to 'mdb_list'.
      Signed-off-by: NAmit Cohen <amcohen@nvidia.com>
      Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      eaa0791a
    • A
      mlxsw: spectrum_switchdev: Rename MID structure · eede53a4
      Amit Cohen 提交于
      Currently the structure which represents MDB entry is called
      'struct mlxsw_sp_mid'. This name is not accurate as a MID entry stores a
      bitmap of ports to which a packet needs to be replicated and a MDB entry
      stores the mapping from {MAC, FID} to PGT index (MID).
      
      Rename the structure to 'struct mlxsw_sp_mdb_entry'. The structure
      'mlxsw_sp_mid' is defined as part of spectrum.h. The only file which
      uses it is spectrum_switchdev.c, so there is no reason to expose it to
      other files. Move the definition to spectrum_switchdev.c.
      Signed-off-by: NAmit Cohen <amcohen@nvidia.com>
      Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      eede53a4
  3. 28 6月, 2022 1 次提交
  4. 22 6月, 2022 7 次提交
  5. 20 6月, 2022 1 次提交
    • A
      mlxsw: Add SMPE related fields to SMID2 register · 894b98d5
      Amit Cohen 提交于
      SMID register maps multicast ID (MID) into a list of local ports.
      As preparation for unified bridge model, add some required fields for
      future use.
      
      The device includes two main tables to support layer 2 multicast (i.e.,
      MDB and flooding). These are the PGT (Port Group Table) and the
      MPE (Multicast Port Egress) table.
      - PGT is {MID -> (bitmap of local_port, SPME index)}
      - MPE is {(Local port, SMPE index) -> eVID}
      
      In Spectrum-1, both indexes into the MPE table (local port and SMPE) are
      derived from the PGT table. Therefore, the SMPE index needs to be
      programmed as part of the PGT entry via new fields in SMID - 'smpe_valid'
      and 'smpe'.
      
      Add the two mentioned fields for future use and align the callers of
      mlxsw_reg_smid2_pack() to pass zeros for SMPE fields.
      Signed-off-by: NAmit Cohen <amcohen@nvidia.com>
      Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      894b98d5
  6. 04 5月, 2022 1 次提交
  7. 17 2月, 2022 1 次提交
  8. 28 1月, 2022 1 次提交
    • A
      mlxsw: spectrum: Guard against invalid local ports · bcdfd615
      Amit Cohen 提交于
      When processing events generated by the device's firmware, the driver
      protects itself from events reported for non-existent local ports, but
      not for the CPU port (local port 0), which exists, but does not have all
      the fields as any local port.
      
      This can result in a NULL pointer dereference when trying access
      'struct mlxsw_sp_port' fields which are not initialized for CPU port.
      
      Commit 63b08b1f ("mlxsw: spectrum: Protect driver from buggy firmware")
      already handled such issue by bailing early when processing a PUDE event
      reported for the CPU port.
      
      Generalize the approach by moving the check to a common function and
      making use of it in all relevant places.
      Signed-off-by: NAmit Cohen <amcohen@nvidia.com>
      Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      bcdfd615
  9. 15 12月, 2021 2 次提交
  10. 01 12月, 2021 2 次提交
  11. 25 10月, 2021 1 次提交
  12. 11 8月, 2021 1 次提交
  13. 23 7月, 2021 1 次提交
    • T
      net: bridge: switchdev: allow the TX data plane forwarding to be offloaded · 47211192
      Tobias Waldekranz 提交于
      Allow switchdevs to forward frames from the CPU in accordance with the
      bridge configuration in the same way as is done between bridge
      ports. This means that the bridge will only send a single skb towards
      one of the ports under the switchdev's control, and expects the driver
      to deliver the packet to all eligible ports in its domain.
      
      Primarily this improves the performance of multicast flows with
      multiple subscribers, as it allows the hardware to perform the frame
      replication.
      
      The basic flow between the driver and the bridge is as follows:
      
      - When joining a bridge port, the switchdev driver calls
        switchdev_bridge_port_offload() with tx_fwd_offload = true.
      
      - The bridge sends offloadable skbs to one of the ports under the
        switchdev's control using skb->offload_fwd_mark = true.
      
      - The switchdev driver checks the skb->offload_fwd_mark field and lets
        its FDB lookup select the destination port mask for this packet.
      
      v1->v2:
      - convert br_input_skb_cb::fwd_hwdoms to a plain unsigned long
      - introduce a static key "br_switchdev_fwd_offload_used" to minimize the
        impact of the newly introduced feature on all the setups which don't
        have hardware that can make use of it
      - introduce a check for nbp->flags & BR_FWD_OFFLOAD to optimize cache
        line access
      - reorder nbp_switchdev_frame_mark_accel() and br_handle_vlan() in
        __br_forward()
      - do not strip VLAN on egress if forwarding offload on VLAN-aware bridge
        is being used
      - propagate errors from .ndo_dfwd_add_station() if not EOPNOTSUPP
      
      v2->v3:
      - replace the solution based on .ndo_dfwd_add_station with a solution
        based on switchdev_bridge_port_offload
      - rename BR_FWD_OFFLOAD to BR_TX_FWD_OFFLOAD
      v3->v4: rebase
      v4->v5:
      - make sure the static key is decremented on bridge port unoffload
      - more function and variable renaming and comments for them:
        br_switchdev_fwd_offload_used to br_switchdev_tx_fwd_offload
        br_switchdev_accels_skb to br_switchdev_frame_uses_tx_fwd_offload
        nbp_switchdev_frame_mark_tx_fwd to nbp_switchdev_frame_mark_tx_fwd_to_hwdom
        nbp_switchdev_frame_mark_accel to nbp_switchdev_frame_mark_tx_fwd_offload
        fwd_accel to tx_fwd_offload
      Signed-off-by: NTobias Waldekranz <tobias@waldekranz.com>
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      47211192
  14. 22 7月, 2021 2 次提交
    • V
      net: bridge: move the switchdev object replay helpers to "push" mode · 4e51bf44
      Vladimir Oltean 提交于
      Starting with commit 4f2673b3 ("net: bridge: add helper to replay
      port and host-joined mdb entries"), DSA has introduced some bridge
      helpers that replay switchdev events (FDB/MDB/VLAN additions and
      deletions) that can be lost by the switchdev drivers in a variety of
      circumstances:
      
      - an IP multicast group was host-joined on the bridge itself before any
        switchdev port joined the bridge, leading to the host MDB entries
        missing in the hardware database.
      - during the bridge creation process, the MAC address of the bridge was
        added to the FDB as an entry pointing towards the bridge device
        itself, but with no switchdev ports being part of the bridge yet, this
        local FDB entry would remain unknown to the switchdev hardware
        database.
      - a VLAN/FDB/MDB was added to a bridge port that is a LAG interface,
        before any switchdev port joined that LAG, leading to the hardware
        database missing those entries.
      - a switchdev port left a LAG that is a bridge port, while the LAG
        remained part of the bridge, and all FDB/MDB/VLAN entries remained
        installed in the hardware database of the switchdev port.
      
      Also, since commit 0d2cfbd4 ("net: bridge: ignore switchdev events
      for LAG ports which didn't request replay"), DSA introduced a method,
      based on a const void *ctx, to ensure that two switchdev ports under the
      same LAG that is a bridge port do not see the same MDB/VLAN entry being
      replayed twice by the bridge, once for every bridge port that joins the
      LAG.
      
      With so many ordering corner cases being possible, it seems unreasonable
      to expect a switchdev driver writer to get it right from the first try.
      Therefore, now that DSA has experimented with the bridge replay helpers
      for a little bit, we can move the code to the bridge driver where it is
      more readily available to all switchdev drivers.
      
      To convert the switchdev object replay helpers from "pull mode" (where
      the driver asks for them) to a "push mode" (where the bridge offers them
      automatically), the biggest problem is that the bridge needs to be aware
      when a switchdev port joins and leaves, even when the switchdev is only
      indirectly a bridge port (for example when the bridge port is a LAG
      upper of the switchdev).
      
      Luckily, we already have a hook for that, in the form of the newly
      introduced switchdev_bridge_port_offload() and
      switchdev_bridge_port_unoffload() calls. These offer a natural place for
      hooking the object addition and deletion replays.
      
      Extend the above 2 functions with:
      - pointers to the switchdev atomic notifier (for FDB replays) and the
        blocking notifier (for MDB and VLAN replays).
      - the "const void *ctx" argument required for drivers to be able to
        disambiguate between which port is targeted, when multiple ports are
        lowers of the same LAG that is a bridge port. Most of the drivers pass
        NULL to this argument, except the ones that support LAG offload and have
        the proper context check already in place in the switchdev blocking
        notifier handler.
      
      Also unexport the replay helpers, since nobody except the bridge calls
      them directly now.
      
      Note that:
      (a) we abuse the terminology slightly, because FDB entries are not
          "switchdev objects", but we count them as objects nonetheless.
          With no direct way to prove it, I think they are not modeled as
          switchdev objects because those can only be installed by the bridge
          to the hardware (as opposed to FDB entries which can be propagated
          in the other direction too). This is merely an abuse of terms, FDB
          entries are replayed too, despite not being objects.
      (b) the bridge does not attempt to sync port attributes to newly joined
          ports, just the countable stuff (the objects). The reason for this
          is simple: no universal and symmetric way to sync and unsync them is
          known. For example, VLAN filtering: what to do on unsync, disable or
          leave it enabled? Similarly, STP state, ageing timer, etc etc. What
          a switchdev port does when it becomes standalone again is not really
          up to the bridge's competence, and the driver should deal with it.
          On the other hand, replaying deletions of switchdev objects can be
          seen a matter of cleanup and therefore be treated by the bridge,
          hence this patch.
      
      We make the replay helpers opt-in for drivers, because they might not
      bring immediate benefits for them:
      
      - nbp_vlan_init() is called _after_ netdev_master_upper_dev_link(),
        so br_vlan_replay() should not do anything for the new drivers on
        which we call it. The existing drivers where there was even a slight
        possibility for there to exist a VLAN on a bridge port before they
        join it are already guarded against this: mlxsw and prestera deny
        joining LAG interfaces that are members of a bridge.
      
      - br_fdb_replay() should now notify of local FDB entries, but I patched
        all drivers except DSA to ignore these new entries in commit
        2c4eca3e ("net: bridge: switchdev: include local flag in FDB
        notifications"). Driver authors can lift this restriction as they
        wish, and when they do, they can also opt into the FDB replay
        functionality.
      
      - br_mdb_replay() should fix a real issue which is described in commit
        4f2673b3 ("net: bridge: add helper to replay port and host-joined
        mdb entries"). However most drivers do not offload the
        SWITCHDEV_OBJ_ID_HOST_MDB to see this issue: only cpsw and am65_cpsw
        offload this switchdev object, and I don't completely understand the
        way in which they offload this switchdev object anyway. So I'll leave
        it up to these drivers' respective maintainers to opt into
        br_mdb_replay().
      
      So most of the drivers pass NULL notifier blocks for the replay helpers,
      except:
      - dpaa2-switch which was already acked/regression-tested with the
        helpers enabled (and there isn't much of a downside in having them)
      - ocelot which already had replay logic in "pull" mode
      - DSA which already had replay logic in "pull" mode
      
      An important observation is that the drivers which don't currently
      request bridge event replays don't even have the
      switchdev_bridge_port_{offload,unoffload} calls placed in proper places
      right now. This was done to avoid unnecessary rework for drivers which
      might never even add support for this. For driver writers who wish to
      add replay support, this can be used as a tentative placement guide:
      https://patchwork.kernel.org/project/netdevbpf/patch/20210720134655.892334-11-vladimir.oltean@nxp.com/
      
      Cc: Vadym Kochan <vkochan@marvell.com>
      Cc: Taras Chornyi <tchornyi@marvell.com>
      Cc: Ioana Ciornei <ioana.ciornei@nxp.com>
      Cc: Lars Povlsen <lars.povlsen@microchip.com>
      Cc: Steen Hegelund <Steen.Hegelund@microchip.com>
      Cc: UNGLinuxDriver@microchip.com
      Cc: Claudiu Manoil <claudiu.manoil@nxp.com>
      Cc: Alexandre Belloni <alexandre.belloni@bootlin.com>
      Cc: Grygorii Strashko <grygorii.strashko@ti.com>
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Acked-by: Ioana Ciornei <ioana.ciornei@nxp.com> # dpaa2-switch
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4e51bf44
    • V
      net: bridge: switchdev: let drivers inform which bridge ports are offloaded · 2f5dc00f
      Vladimir Oltean 提交于
      On reception of an skb, the bridge checks if it was marked as 'already
      forwarded in hardware' (checks if skb->offload_fwd_mark == 1), and if it
      is, it assigns the source hardware domain of that skb based on the
      hardware domain of the ingress port. Then during forwarding, it enforces
      that the egress port must have a different hardware domain than the
      ingress one (this is done in nbp_switchdev_allowed_egress).
      
      Non-switchdev drivers don't report any physical switch id (neither
      through devlink nor .ndo_get_port_parent_id), therefore the bridge
      assigns them a hardware domain of 0, and packets coming from them will
      always have skb->offload_fwd_mark = 0. So there aren't any restrictions.
      
      Problems appear due to the fact that DSA would like to perform software
      fallback for bonding and team interfaces that the physical switch cannot
      offload.
      
             +-- br0 ---+
            / /   |      \
           / /    |       \
          /  |    |      bond0
         /   |    |     /    \
       swp0 swp1 swp2 swp3 swp4
      
      There, it is desirable that the presence of swp3 and swp4 under a
      non-offloaded LAG does not preclude us from doing hardware bridging
      beteen swp0, swp1 and swp2. The bandwidth of the CPU is often times high
      enough that software bridging between {swp0,swp1,swp2} and bond0 is not
      impractical.
      
      But this creates an impossible paradox given the current way in which
      port hardware domains are assigned. When the driver receives a packet
      from swp0 (say, due to flooding), it must set skb->offload_fwd_mark to
      something.
      
      - If we set it to 0, then the bridge will forward it towards swp1, swp2
        and bond0. But the switch has already forwarded it towards swp1 and
        swp2 (not to bond0, remember, that isn't offloaded, so as far as the
        switch is concerned, ports swp3 and swp4 are not looking up the FDB,
        and the entire bond0 is a destination that is strictly behind the
        CPU). But we don't want duplicated traffic towards swp1 and swp2, so
        it's not ok to set skb->offload_fwd_mark = 0.
      
      - If we set it to 1, then the bridge will not forward the skb towards
        the ports with the same switchdev mark, i.e. not to swp1, swp2 and
        bond0. Towards swp1 and swp2 that's ok, but towards bond0? It should
        have forwarded the skb there.
      
      So the real issue is that bond0 will be assigned the same hardware
      domain as {swp0,swp1,swp2}, because the function that assigns hardware
      domains to bridge ports, nbp_switchdev_add(), recurses through bond0's
      lower interfaces until it finds something that implements devlink (calls
      dev_get_port_parent_id with bool recurse = true). This is a problem
      because the fact that bond0 can be offloaded by swp3 and swp4 in our
      example is merely an assumption.
      
      A solution is to give the bridge explicit hints as to what hardware
      domain it should use for each port.
      
      Currently, the bridging offload is very 'silent': a driver registers a
      netdevice notifier, which is put on the netns's notifier chain, and
      which sniffs around for NETDEV_CHANGEUPPER events where the upper is a
      bridge, and the lower is an interface it knows about (one registered by
      this driver, normally). Then, from within that notifier, it does a bunch
      of stuff behind the bridge's back, without the bridge necessarily
      knowing that there's somebody offloading that port. It looks like this:
      
           ip link set swp0 master br0
                        |
                        v
       br_add_if() calls netdev_master_upper_dev_link()
                        |
                        v
              call_netdevice_notifiers
                        |
                        v
             dsa_slave_netdevice_event
                        |
                        v
              oh, hey! it's for me!
                        |
                        v
                 .port_bridge_join
      
      What we do to solve the conundrum is to be less silent, and change the
      switchdev drivers to present themselves to the bridge. Something like this:
      
           ip link set swp0 master br0
                        |
                        v
       br_add_if() calls netdev_master_upper_dev_link()
                        |
                        v                    bridge: Aye! I'll use this
              call_netdevice_notifiers           ^  ppid as the
                        |                        |  hardware domain for
                        v                        |  this port, and zero
             dsa_slave_netdevice_event           |  if I got nothing.
                        |                        |
                        v                        |
              oh, hey! it's for me!              |
                        |                        |
                        v                        |
                 .port_bridge_join               |
                        |                        |
                        +------------------------+
                   switchdev_bridge_port_offload(swp0, swp0)
      
      Then stacked interfaces (like bond0 on top of swp3/swp4) would be
      treated differently in DSA, depending on whether we can or cannot
      offload them.
      
      The offload case:
      
          ip link set bond0 master br0
                        |
                        v
       br_add_if() calls netdev_master_upper_dev_link()
                        |
                        v                    bridge: Aye! I'll use this
              call_netdevice_notifiers           ^  ppid as the
                        |                        |  switchdev mark for
                        v                        |        bond0.
             dsa_slave_netdevice_event           | Coincidentally (or not),
                        |                        | bond0 and swp0, swp1, swp2
                        v                        | all have the same switchdev
              hmm, it's not quite for me,        | mark now, since the ASIC
               but my driver has already         | is able to forward towards
                 called .port_lag_join           | all these ports in hw.
                for it, because I have           |
            a port with dp->lag_dev == bond0.    |
                        |                        |
                        v                        |
                 .port_bridge_join               |
                 for swp3 and swp4               |
                        |                        |
                        +------------------------+
                  switchdev_bridge_port_offload(bond0, swp3)
                  switchdev_bridge_port_offload(bond0, swp4)
      
      And the non-offload case:
      
          ip link set bond0 master br0
                        |
                        v
       br_add_if() calls netdev_master_upper_dev_link()
                        |
                        v                    bridge waiting:
              call_netdevice_notifiers           ^  huh, switchdev_bridge_port_offload
                        |                        |  wasn't called, okay, I'll use a
                        v                        |  hwdom of zero for this one.
             dsa_slave_netdevice_event           :  Then packets received on swp0 will
                        |                        :  not be software-forwarded towards
                        v                        :  swp1, but they will towards bond0.
               it's not for me, but
             bond0 is an upper of swp3
            and swp4, but their dp->lag_dev
             is NULL because they couldn't
                  offload it.
      
      Basically we can draw the conclusion that the lowers of a bridge port
      can come and go, so depending on the configuration of lowers for a
      bridge port, it can dynamically toggle between offloaded and unoffloaded.
      Therefore, we need an equivalent switchdev_bridge_port_unoffload too.
      
      This patch changes the way any switchdev driver interacts with the
      bridge. From now on, everybody needs to call switchdev_bridge_port_offload
      and switchdev_bridge_port_unoffload, otherwise the bridge will treat the
      port as non-offloaded and allow software flooding to other ports from
      the same ASIC.
      
      Note that these functions lay the ground for a more complex handshake
      between switchdev drivers and the bridge in the future.
      
      For drivers that will request a replay of the switchdev objects when
      they offload and unoffload a bridge port (DSA, dpaa2-switch, ocelot), we
      place the call to switchdev_bridge_port_unoffload() strategically inside
      the NETDEV_PRECHANGEUPPER notifier's code path, and not inside
      NETDEV_CHANGEUPPER. This is because the switchdev object replay helpers
      need the netdev adjacency lists to be valid, and that is only true in
      NETDEV_PRECHANGEUPPER.
      
      Cc: Vadym Kochan <vkochan@marvell.com>
      Cc: Taras Chornyi <tchornyi@marvell.com>
      Cc: Ioana Ciornei <ioana.ciornei@nxp.com>
      Cc: Lars Povlsen <lars.povlsen@microchip.com>
      Cc: Steen Hegelund <Steen.Hegelund@microchip.com>
      Cc: UNGLinuxDriver@microchip.com
      Cc: Claudiu Manoil <claudiu.manoil@nxp.com>
      Cc: Alexandre Belloni <alexandre.belloni@bootlin.com>
      Cc: Grygorii Strashko <grygorii.strashko@ti.com>
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Tested-by: Ioana Ciornei <ioana.ciornei@nxp.com> # dpaa2-switch: regression
      Acked-by: Ioana Ciornei <ioana.ciornei@nxp.com> # dpaa2-switch
      Tested-by: Horatiu Vultur <horatiu.vultur@microchip.com> # ocelot-switch
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2f5dc00f
  15. 17 7月, 2021 1 次提交
  16. 29 6月, 2021 1 次提交
    • V
      net: switchdev: add a context void pointer to struct switchdev_notifier_info · 69bfac96
      Vladimir Oltean 提交于
      In the case where the driver asks for a replay of a certain type of
      event (port object or attribute) for a bridge port that is a LAG, it may
      do so because this port has just joined the LAG.
      
      But there might already be other switchdev ports in that LAG, and it is
      preferable that those preexisting switchdev ports do not act upon the
      replayed event.
      
      The solution is to add a context to switchdev events, which is NULL most
      of the time (when the bridge layer initiates the call) but which can be
      set to a value controlled by the switchdev driver when a replay is
      requested. The driver can then check the context to figure out if all
      ports within the LAG should act upon the switchdev event, or just the
      ones that match the context.
      
      We have to modify all switchdev_handle_* helper functions as well as the
      prototypes in the drivers that use these helpers too, because these
      helpers hide the underlying struct switchdev_notifier_info from us and
      there is no way to retrieve the context otherwise.
      
      The context structure will be populated and used in later patches.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      69bfac96
  17. 18 5月, 2021 1 次提交
  18. 17 4月, 2021 1 次提交
  19. 18 3月, 2021 2 次提交
    • A
      mlxsw: Allow 802.1d and .1ad VxLAN bridges to coexist on Spectrum>=2 · bf677bd2
      Amit Cohen 提交于
      Currently only one EtherType can be configured for pushing in tunnels
      because EtherType is configured using SPVID.et_vlan for tunnel port.
      
      This behavior is forbidden by comparing mlxsw_sp_nve_config struct for
      each new tunnel, the struct contains 'ethertype' field which means that
      only one EtherType is legal at any given time. Remove 'ethertype' field to
      allow creating VxLAN devices with different bridges.
      
      To allow using several types of VxLAN bridges at the same time, the
      EtherType should be determined at the egress port. This behavior is
      achieved by setting SPVID to decide which EtherType to push at egress and
      for each local_port which is member in 802.1ad bridge, set SPEVET.et_vlan
      to ether_type1 (i.e., 0x88A8).
      
      Use switchdev_ops->init() to set different mlxsw_sp_bridge_ops for
      different ASICs in order to be able to split the behavior when port joins /
      leaves an 802.1ad bridge in different ASICs.
      Signed-off-by: NAmit Cohen <amcohen@nvidia.com>
      Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bf677bd2
    • A
      mlxsw: Add struct mlxsw_sp_switchdev_ops per ASIC · 0f74fa56
      Amit Cohen 提交于
      A subsequent patch will need to implement different set of operations
      when a port joins / leaves an 802.1ad bridge, based on the ASIC type.
      
      Prepare for this change by allowing to initialize the bridge module
      based on the ASIC type via 'struct mlxsw_sp_switchdev_ops'.
      Signed-off-by: NAmit Cohen <amcohen@nvidia.com>
      Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0f74fa56
  20. 13 2月, 2021 2 次提交