1. 23 7月, 2021 5 次提交
    • A
      net: socket: rework compat_ifreq_ioctl() · 29c49648
      Arnd Bergmann 提交于
      compat_ifreq_ioctl() is one of the last users of copy_in_user() and
      compat_alloc_user_space(), as it attempts to convert the 'struct ifreq'
      arguments from 32-bit to 64-bit format as used by dev_ioctl() and a
      couple of socket family specific interpretations.
      
      The current implementation works correctly when calling dev_ioctl(),
      inet_ioctl(), ieee802154_sock_ioctl(), atalk_ioctl(), qrtr_ioctl()
      and packet_ioctl(). The ioctl handlers for x25, netrom, rose and x25 do
      not interpret the arguments and only block the corresponding commands,
      so they do not care.
      
      For af_inet6 and af_decnet however, the compat conversion is slightly
      incorrect, as it will copy more data than the native handler accesses,
      both of them use a structure that is shorter than ifreq.
      
      Replace the copy_in_user() conversion with a pair of accessor functions
      to read and write the ifreq data in place with the correct length where
      needed, while leaving the other ones to copy the (already compatible)
      structures directly.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      29c49648
    • A
      net: socket: simplify dev_ifconf handling · 876f0bf9
      Arnd Bergmann 提交于
      The dev_ifconf() calling conventions make compat handling
      more complicated than necessary, simplify this by moving
      the in_compat_syscall() check into the function.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      876f0bf9
    • A
      net: socket: remove register_gifconf · b0e99d03
      Arnd Bergmann 提交于
      Since dynamic registration of the gifconf() helper is only used for
      IPv4, and this can not be in a loadable module, this can be simplified
      noticeably by turning it into a direct function call as a preparation
      for cleaning up the compat handling.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b0e99d03
    • A
      net: socket: rework SIOC?IFMAP ioctls · 709566d7
      Arnd Bergmann 提交于
      SIOCGIFMAP and SIOCSIFMAP currently require compat_alloc_user_space()
      and copy_in_user() for compat mode.
      
      Move the compat handling into the location where the structures are
      actually used, to avoid using those interfaces and get a clearer
      implementation.
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      709566d7
    • A
      ethtool: improve compat ioctl handling · dd98d289
      Arnd Bergmann 提交于
      The ethtool compat ioctl handling is hidden away in net/socket.c,
      which introduces a couple of minor oddities:
      
      - The implementation may end up diverging, as seen in the RXNFC
        extension in commit 84a1d9c4 ("net: ethtool: extend RXNFC
        API to support RSS spreading of filter matches") that does not work
        in compat mode.
      
      - Most architectures do not need the compat handling at all
        because u64 and compat_u64 have the same alignment.
      
      - On x86, the conversion is done for both x32 and i386 user space,
        but it's actually wrong to do it for x32 and cannot work there.
      
      - On 32-bit Arm, it never worked for compat oabi user space, since
        that needs to do the same conversion but does not.
      
      - It would be nice to get rid of both compat_alloc_user_space()
        and copy_in_user() throughout the kernel.
      
      None of these actually seems to be a serious problem that real
      users are likely to encounter, but fixing all of them actually
      leads to code that is both shorter and more readable.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dd98d289
  2. 22 7月, 2021 18 次提交
    • M
      ipv6: fix "'ioam6_if_id_max' defined but not used" warn · 176f716c
      Matthieu Baerts 提交于
      When compiling without CONFIG_SYSCTL, this warning appears:
      
        net/ipv6/addrconf.c:99:12: error: 'ioam6_if_id_max' defined but not used [-Werror=unused-variable]
           99 | static u32 ioam6_if_id_max = U16_MAX;
              |            ^~~~~~~~~~~~~~~
        cc1: all warnings being treated as errors
      
      Simply moving the declaration of this variable under ...
      
        #ifdef CONFIG_SYSCTL
      
      ... with other similar variables fixes the issue.
      
      Fixes: 9ee11f0f ("ipv6: ioam: Data plane support for Pre-allocated Trace")
      Signed-off-by: NMatthieu Baerts <matthieu.baerts@tessares.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      176f716c
    • O
      net: selftests: add MTU test · 802a76af
      Oleksij Rempel 提交于
      Test if we actually can send/receive packets with MTU size. This kind of
      issue was detected on ASIX HW with bogus EEPROM.
      Signed-off-by: NOleksij Rempel <o.rempel@pengutronix.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      802a76af
    • Y
      net: sched: cls_api: Fix the the wrong parameter · 9d85a6f4
      Yajun Deng 提交于
      The 4th parameter in tc_chain_notify() should be flags rather than seq.
      Let's change it back correctly.
      
      Fixes: 32a4f5ec ("net: sched: introduce chain object to uapi")
      Signed-off-by: NYajun Deng <yajun.deng@linux.dev>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9d85a6f4
    • V
      net: switchdev: fix FDB entries towards foreign ports not getting propagated to us · 2b0a5688
      Vladimir Oltean 提交于
      The newly introduced switchdev_handle_fdb_{add,del}_to_device helpers
      solved a problem but introduced another one. They have a severe design
      bug: they do not propagate FDB events on foreign interfaces to us, i.e.
      this use case:
      
               br0
              /   \
             /     \
            /       \
           /         \
         swp0       eno0
      (switchdev)  (foreign)
      
      when an address is learned on eno0, what is supposed to happen is that
      this event should also be propagated towards swp0. Somehow I managed to
      convince myself that this did work correctly, but obviously it does not.
      
      The trouble with foreign interfaces is that we must reach a switchdev
      net_device pointer through a foreign net_device that has no direct
      upper/lower relationship with it. So we need to do exploratory searching
      through the lower interfaces of the foreign net_device's bridge upper
      (to reach swp0 from eno0, we must check its upper, br0, for lower
      interfaces that pass the check_cb and foreign_dev_check_cb). This is
      something that the previous code did not do, it just assumed that "dev"
      will become a switchdev interface at some point, somehow, probably by
      magic.
      
      With this patch, assisted address learning on the CPU port works again
      in DSA:
      
      ip link add br0 type bridge
      ip link set swp0 master br0
      ip link set eno0 master br0
      ip link set br0 up
      
      [   46.708929] mscc_felix 0000:00:00.5 swp0: Adding FDB entry towards eno0, addr 00:04:9f:05:f4:ab vid 0 as host address
      
      Fixes: 8ca07176 ("net: switchdev: introduce a fanout helper for SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE")
      Reported-by: NEric Woudstra <ericwouds@gmail.com>
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2b0a5688
    • V
      net: bridge: move the switchdev object replay helpers to "push" mode · 4e51bf44
      Vladimir Oltean 提交于
      Starting with commit 4f2673b3 ("net: bridge: add helper to replay
      port and host-joined mdb entries"), DSA has introduced some bridge
      helpers that replay switchdev events (FDB/MDB/VLAN additions and
      deletions) that can be lost by the switchdev drivers in a variety of
      circumstances:
      
      - an IP multicast group was host-joined on the bridge itself before any
        switchdev port joined the bridge, leading to the host MDB entries
        missing in the hardware database.
      - during the bridge creation process, the MAC address of the bridge was
        added to the FDB as an entry pointing towards the bridge device
        itself, but with no switchdev ports being part of the bridge yet, this
        local FDB entry would remain unknown to the switchdev hardware
        database.
      - a VLAN/FDB/MDB was added to a bridge port that is a LAG interface,
        before any switchdev port joined that LAG, leading to the hardware
        database missing those entries.
      - a switchdev port left a LAG that is a bridge port, while the LAG
        remained part of the bridge, and all FDB/MDB/VLAN entries remained
        installed in the hardware database of the switchdev port.
      
      Also, since commit 0d2cfbd4 ("net: bridge: ignore switchdev events
      for LAG ports which didn't request replay"), DSA introduced a method,
      based on a const void *ctx, to ensure that two switchdev ports under the
      same LAG that is a bridge port do not see the same MDB/VLAN entry being
      replayed twice by the bridge, once for every bridge port that joins the
      LAG.
      
      With so many ordering corner cases being possible, it seems unreasonable
      to expect a switchdev driver writer to get it right from the first try.
      Therefore, now that DSA has experimented with the bridge replay helpers
      for a little bit, we can move the code to the bridge driver where it is
      more readily available to all switchdev drivers.
      
      To convert the switchdev object replay helpers from "pull mode" (where
      the driver asks for them) to a "push mode" (where the bridge offers them
      automatically), the biggest problem is that the bridge needs to be aware
      when a switchdev port joins and leaves, even when the switchdev is only
      indirectly a bridge port (for example when the bridge port is a LAG
      upper of the switchdev).
      
      Luckily, we already have a hook for that, in the form of the newly
      introduced switchdev_bridge_port_offload() and
      switchdev_bridge_port_unoffload() calls. These offer a natural place for
      hooking the object addition and deletion replays.
      
      Extend the above 2 functions with:
      - pointers to the switchdev atomic notifier (for FDB replays) and the
        blocking notifier (for MDB and VLAN replays).
      - the "const void *ctx" argument required for drivers to be able to
        disambiguate between which port is targeted, when multiple ports are
        lowers of the same LAG that is a bridge port. Most of the drivers pass
        NULL to this argument, except the ones that support LAG offload and have
        the proper context check already in place in the switchdev blocking
        notifier handler.
      
      Also unexport the replay helpers, since nobody except the bridge calls
      them directly now.
      
      Note that:
      (a) we abuse the terminology slightly, because FDB entries are not
          "switchdev objects", but we count them as objects nonetheless.
          With no direct way to prove it, I think they are not modeled as
          switchdev objects because those can only be installed by the bridge
          to the hardware (as opposed to FDB entries which can be propagated
          in the other direction too). This is merely an abuse of terms, FDB
          entries are replayed too, despite not being objects.
      (b) the bridge does not attempt to sync port attributes to newly joined
          ports, just the countable stuff (the objects). The reason for this
          is simple: no universal and symmetric way to sync and unsync them is
          known. For example, VLAN filtering: what to do on unsync, disable or
          leave it enabled? Similarly, STP state, ageing timer, etc etc. What
          a switchdev port does when it becomes standalone again is not really
          up to the bridge's competence, and the driver should deal with it.
          On the other hand, replaying deletions of switchdev objects can be
          seen a matter of cleanup and therefore be treated by the bridge,
          hence this patch.
      
      We make the replay helpers opt-in for drivers, because they might not
      bring immediate benefits for them:
      
      - nbp_vlan_init() is called _after_ netdev_master_upper_dev_link(),
        so br_vlan_replay() should not do anything for the new drivers on
        which we call it. The existing drivers where there was even a slight
        possibility for there to exist a VLAN on a bridge port before they
        join it are already guarded against this: mlxsw and prestera deny
        joining LAG interfaces that are members of a bridge.
      
      - br_fdb_replay() should now notify of local FDB entries, but I patched
        all drivers except DSA to ignore these new entries in commit
        2c4eca3e ("net: bridge: switchdev: include local flag in FDB
        notifications"). Driver authors can lift this restriction as they
        wish, and when they do, they can also opt into the FDB replay
        functionality.
      
      - br_mdb_replay() should fix a real issue which is described in commit
        4f2673b3 ("net: bridge: add helper to replay port and host-joined
        mdb entries"). However most drivers do not offload the
        SWITCHDEV_OBJ_ID_HOST_MDB to see this issue: only cpsw and am65_cpsw
        offload this switchdev object, and I don't completely understand the
        way in which they offload this switchdev object anyway. So I'll leave
        it up to these drivers' respective maintainers to opt into
        br_mdb_replay().
      
      So most of the drivers pass NULL notifier blocks for the replay helpers,
      except:
      - dpaa2-switch which was already acked/regression-tested with the
        helpers enabled (and there isn't much of a downside in having them)
      - ocelot which already had replay logic in "pull" mode
      - DSA which already had replay logic in "pull" mode
      
      An important observation is that the drivers which don't currently
      request bridge event replays don't even have the
      switchdev_bridge_port_{offload,unoffload} calls placed in proper places
      right now. This was done to avoid unnecessary rework for drivers which
      might never even add support for this. For driver writers who wish to
      add replay support, this can be used as a tentative placement guide:
      https://patchwork.kernel.org/project/netdevbpf/patch/20210720134655.892334-11-vladimir.oltean@nxp.com/
      
      Cc: Vadym Kochan <vkochan@marvell.com>
      Cc: Taras Chornyi <tchornyi@marvell.com>
      Cc: Ioana Ciornei <ioana.ciornei@nxp.com>
      Cc: Lars Povlsen <lars.povlsen@microchip.com>
      Cc: Steen Hegelund <Steen.Hegelund@microchip.com>
      Cc: UNGLinuxDriver@microchip.com
      Cc: Claudiu Manoil <claudiu.manoil@nxp.com>
      Cc: Alexandre Belloni <alexandre.belloni@bootlin.com>
      Cc: Grygorii Strashko <grygorii.strashko@ti.com>
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Acked-by: Ioana Ciornei <ioana.ciornei@nxp.com> # dpaa2-switch
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4e51bf44
    • V
      net: bridge: guard the switchdev replay helpers against a NULL notifier block · 7105b50b
      Vladimir Oltean 提交于
      There is a desire to make the object and FDB replay helpers optional
      when moving them inside the bridge driver. For example a certain driver
      might not offload host MDBs and there is no case where the replay
      helpers would be of immediate use to it.
      
      So it would be nice if we could allow drivers to pass NULL pointers for
      the atomic and blocking notifier blocks, and the replay helpers to do
      nothing in that case.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7105b50b
    • V
      net: bridge: switchdev: let drivers inform which bridge ports are offloaded · 2f5dc00f
      Vladimir Oltean 提交于
      On reception of an skb, the bridge checks if it was marked as 'already
      forwarded in hardware' (checks if skb->offload_fwd_mark == 1), and if it
      is, it assigns the source hardware domain of that skb based on the
      hardware domain of the ingress port. Then during forwarding, it enforces
      that the egress port must have a different hardware domain than the
      ingress one (this is done in nbp_switchdev_allowed_egress).
      
      Non-switchdev drivers don't report any physical switch id (neither
      through devlink nor .ndo_get_port_parent_id), therefore the bridge
      assigns them a hardware domain of 0, and packets coming from them will
      always have skb->offload_fwd_mark = 0. So there aren't any restrictions.
      
      Problems appear due to the fact that DSA would like to perform software
      fallback for bonding and team interfaces that the physical switch cannot
      offload.
      
             +-- br0 ---+
            / /   |      \
           / /    |       \
          /  |    |      bond0
         /   |    |     /    \
       swp0 swp1 swp2 swp3 swp4
      
      There, it is desirable that the presence of swp3 and swp4 under a
      non-offloaded LAG does not preclude us from doing hardware bridging
      beteen swp0, swp1 and swp2. The bandwidth of the CPU is often times high
      enough that software bridging between {swp0,swp1,swp2} and bond0 is not
      impractical.
      
      But this creates an impossible paradox given the current way in which
      port hardware domains are assigned. When the driver receives a packet
      from swp0 (say, due to flooding), it must set skb->offload_fwd_mark to
      something.
      
      - If we set it to 0, then the bridge will forward it towards swp1, swp2
        and bond0. But the switch has already forwarded it towards swp1 and
        swp2 (not to bond0, remember, that isn't offloaded, so as far as the
        switch is concerned, ports swp3 and swp4 are not looking up the FDB,
        and the entire bond0 is a destination that is strictly behind the
        CPU). But we don't want duplicated traffic towards swp1 and swp2, so
        it's not ok to set skb->offload_fwd_mark = 0.
      
      - If we set it to 1, then the bridge will not forward the skb towards
        the ports with the same switchdev mark, i.e. not to swp1, swp2 and
        bond0. Towards swp1 and swp2 that's ok, but towards bond0? It should
        have forwarded the skb there.
      
      So the real issue is that bond0 will be assigned the same hardware
      domain as {swp0,swp1,swp2}, because the function that assigns hardware
      domains to bridge ports, nbp_switchdev_add(), recurses through bond0's
      lower interfaces until it finds something that implements devlink (calls
      dev_get_port_parent_id with bool recurse = true). This is a problem
      because the fact that bond0 can be offloaded by swp3 and swp4 in our
      example is merely an assumption.
      
      A solution is to give the bridge explicit hints as to what hardware
      domain it should use for each port.
      
      Currently, the bridging offload is very 'silent': a driver registers a
      netdevice notifier, which is put on the netns's notifier chain, and
      which sniffs around for NETDEV_CHANGEUPPER events where the upper is a
      bridge, and the lower is an interface it knows about (one registered by
      this driver, normally). Then, from within that notifier, it does a bunch
      of stuff behind the bridge's back, without the bridge necessarily
      knowing that there's somebody offloading that port. It looks like this:
      
           ip link set swp0 master br0
                        |
                        v
       br_add_if() calls netdev_master_upper_dev_link()
                        |
                        v
              call_netdevice_notifiers
                        |
                        v
             dsa_slave_netdevice_event
                        |
                        v
              oh, hey! it's for me!
                        |
                        v
                 .port_bridge_join
      
      What we do to solve the conundrum is to be less silent, and change the
      switchdev drivers to present themselves to the bridge. Something like this:
      
           ip link set swp0 master br0
                        |
                        v
       br_add_if() calls netdev_master_upper_dev_link()
                        |
                        v                    bridge: Aye! I'll use this
              call_netdevice_notifiers           ^  ppid as the
                        |                        |  hardware domain for
                        v                        |  this port, and zero
             dsa_slave_netdevice_event           |  if I got nothing.
                        |                        |
                        v                        |
              oh, hey! it's for me!              |
                        |                        |
                        v                        |
                 .port_bridge_join               |
                        |                        |
                        +------------------------+
                   switchdev_bridge_port_offload(swp0, swp0)
      
      Then stacked interfaces (like bond0 on top of swp3/swp4) would be
      treated differently in DSA, depending on whether we can or cannot
      offload them.
      
      The offload case:
      
          ip link set bond0 master br0
                        |
                        v
       br_add_if() calls netdev_master_upper_dev_link()
                        |
                        v                    bridge: Aye! I'll use this
              call_netdevice_notifiers           ^  ppid as the
                        |                        |  switchdev mark for
                        v                        |        bond0.
             dsa_slave_netdevice_event           | Coincidentally (or not),
                        |                        | bond0 and swp0, swp1, swp2
                        v                        | all have the same switchdev
              hmm, it's not quite for me,        | mark now, since the ASIC
               but my driver has already         | is able to forward towards
                 called .port_lag_join           | all these ports in hw.
                for it, because I have           |
            a port with dp->lag_dev == bond0.    |
                        |                        |
                        v                        |
                 .port_bridge_join               |
                 for swp3 and swp4               |
                        |                        |
                        +------------------------+
                  switchdev_bridge_port_offload(bond0, swp3)
                  switchdev_bridge_port_offload(bond0, swp4)
      
      And the non-offload case:
      
          ip link set bond0 master br0
                        |
                        v
       br_add_if() calls netdev_master_upper_dev_link()
                        |
                        v                    bridge waiting:
              call_netdevice_notifiers           ^  huh, switchdev_bridge_port_offload
                        |                        |  wasn't called, okay, I'll use a
                        v                        |  hwdom of zero for this one.
             dsa_slave_netdevice_event           :  Then packets received on swp0 will
                        |                        :  not be software-forwarded towards
                        v                        :  swp1, but they will towards bond0.
               it's not for me, but
             bond0 is an upper of swp3
            and swp4, but their dp->lag_dev
             is NULL because they couldn't
                  offload it.
      
      Basically we can draw the conclusion that the lowers of a bridge port
      can come and go, so depending on the configuration of lowers for a
      bridge port, it can dynamically toggle between offloaded and unoffloaded.
      Therefore, we need an equivalent switchdev_bridge_port_unoffload too.
      
      This patch changes the way any switchdev driver interacts with the
      bridge. From now on, everybody needs to call switchdev_bridge_port_offload
      and switchdev_bridge_port_unoffload, otherwise the bridge will treat the
      port as non-offloaded and allow software flooding to other ports from
      the same ASIC.
      
      Note that these functions lay the ground for a more complex handshake
      between switchdev drivers and the bridge in the future.
      
      For drivers that will request a replay of the switchdev objects when
      they offload and unoffload a bridge port (DSA, dpaa2-switch, ocelot), we
      place the call to switchdev_bridge_port_unoffload() strategically inside
      the NETDEV_PRECHANGEUPPER notifier's code path, and not inside
      NETDEV_CHANGEUPPER. This is because the switchdev object replay helpers
      need the netdev adjacency lists to be valid, and that is only true in
      NETDEV_PRECHANGEUPPER.
      
      Cc: Vadym Kochan <vkochan@marvell.com>
      Cc: Taras Chornyi <tchornyi@marvell.com>
      Cc: Ioana Ciornei <ioana.ciornei@nxp.com>
      Cc: Lars Povlsen <lars.povlsen@microchip.com>
      Cc: Steen Hegelund <Steen.Hegelund@microchip.com>
      Cc: UNGLinuxDriver@microchip.com
      Cc: Claudiu Manoil <claudiu.manoil@nxp.com>
      Cc: Alexandre Belloni <alexandre.belloni@bootlin.com>
      Cc: Grygorii Strashko <grygorii.strashko@ti.com>
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Tested-by: Ioana Ciornei <ioana.ciornei@nxp.com> # dpaa2-switch: regression
      Acked-by: Ioana Ciornei <ioana.ciornei@nxp.com> # dpaa2-switch
      Tested-by: Horatiu Vultur <horatiu.vultur@microchip.com> # ocelot-switch
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2f5dc00f
    • T
      net: bridge: switchdev: recycle unused hwdoms · 85826610
      Tobias Waldekranz 提交于
      Since hwdoms have only been used thus far for equality comparisons, the
      bridge has used the simplest possible assignment policy; using a
      counter to keep track of the last value handed out.
      
      With the upcoming transmit offloading, we need to perform set
      operations efficiently based on hwdoms, e.g. we want to answer
      questions like "has this skb been forwarded to any port within this
      hwdom?"
      
      Move to a bitmap-based allocation scheme that recycles hwdoms once all
      members leaves the bridge. This means that we can use a single
      unsigned long to keep track of the hwdoms that have received an skb.
      
      v1->v2: convert the typedef DECLARE_BITMAP(br_hwdom_map_t, BR_HWDOM_MAX)
              into a plain unsigned long.
      v2->v6: none
      Signed-off-by: NTobias Waldekranz <tobias@waldekranz.com>
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      85826610
    • T
      net: bridge: disambiguate offload_fwd_mark · f7cf972f
      Tobias Waldekranz 提交于
      Before this change, four related - but distinct - concepts where named
      offload_fwd_mark:
      
      - skb->offload_fwd_mark: Set by the switchdev driver if the underlying
        hardware has already forwarded this frame to the other ports in the
        same hardware domain.
      
      - nbp->offload_fwd_mark: An idetifier used to group ports that share
        the same hardware forwarding domain.
      
      - br->offload_fwd_mark: Counter used to make sure that unique IDs are
        used in cases where a bridge contains ports from multiple hardware
        domains.
      
      - skb->cb->offload_fwd_mark: The hardware domain on which the frame
        ingressed and was forwarded.
      
      Introduce the term "hardware forwarding domain" ("hwdom") in the
      bridge to denote a set of ports with the following property:
      
          If an skb with skb->offload_fwd_mark set, is received on a port
          belonging to hwdom N, that frame has already been forwarded to all
          other ports in hwdom N.
      
      By decoupling the name from "offload_fwd_mark", we can extend the
      term's definition in the future - e.g. to add constraints that
      describe expected egress behavior - without overloading the meaning of
      "offload_fwd_mark".
      
      - nbp->offload_fwd_mark thus becomes nbp->hwdom.
      
      - br->offload_fwd_mark becomes br->last_hwdom.
      
      - skb->cb->offload_fwd_mark becomes skb->cb->src_hwdom. The slight
        change in naming here mandates a slight change in behavior of the
        nbp_switchdev_frame_mark() function. Previously, it only set this
        value in skb->cb for packets with skb->offload_fwd_mark true (ones
        which were forwarded in hardware). Whereas now we always track the
        incoming hwdom for all packets coming from a switchdev (even for the
        packets which weren't forwarded in hardware, such as STP BPDUs, IGMP
        reports etc). As all uses of skb->cb->offload_fwd_mark were already
        gated behind checks of skb->offload_fwd_mark, this will not introduce
        any functional change, but it paves the way for future changes where
        the ingressing hwdom must be known for frames coming from a switchdev
        regardless of whether they were forwarded in hardware or not
        (basically, if the skb comes from a switchdev, skb->cb->src_hwdom now
        always tracks which one).
      
        A typical example where this is relevant: the switchdev has a fixed
        configuration to trap STP BPDUs, but STP is not running on the bridge
        and the group_fwd_mask allows them to be forwarded. Say we have this
        setup:
      
              br0
             / | \
            /  |  \
        swp0 swp1 swp2
      
        A BPDU comes in on swp0 and is trapped to the CPU; the driver does not
        set skb->offload_fwd_mark. The bridge determines that the frame should
        be forwarded to swp{1,2}. It is imperative that forward offloading is
        _not_ allowed in this case, as the source hwdom is already "poisoned".
      
        Recording the source hwdom allows this case to be handled properly.
      
      v2->v3: added code comments
      v3->v6: none
      Signed-off-by: NTobias Waldekranz <tobias@waldekranz.com>
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NGrygorii Strashko <grygorii.strashko@ti.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f7cf972f
    • L
      net: dsa: tag_ksz: dont let the hardware process the layer 4 checksum · 37120f23
      Lino Sanfilippo 提交于
      If the checksum calculation is offloaded to the network device (e.g due to
      NETIF_F_HW_CSUM inherited from the DSA master device), the calculated
      layer 4 checksum is incorrect. This is since the DSA tag which is placed
      after the layer 4 data is considered as being part of the daa and thus
      errorneously included into the checksum calculation.
      To avoid this, always calculate the layer 4 checksum in software.
      Signed-off-by: NLino Sanfilippo <LinoSanfilippo@gmx.de>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      37120f23
    • L
      net: dsa: ensure linearized SKBs in case of tail taggers · 21cf377a
      Lino Sanfilippo 提交于
      The function skb_put() that is used by tail taggers to make room for the
      DSA tag must only be called for linearized SKBS. However in case that the
      slave device inherited features like NETIF_F_HW_SG or NETIF_F_FRAGLIST the
      SKB passed to the slaves transmit function may not be linearized.
      Avoid those SKBs by clearing the NETIF_F_HW_SG and NETIF_F_FRAGLIST flags
      for tail taggers.
      Furthermore since the tagging protocol can be changed at runtime move the
      code for setting up the slaves features into dsa_slave_setup_tagger().
      Suggested-by: NVladimir Oltean <olteanv@gmail.com>
      Signed-off-by: NLino Sanfilippo <LinoSanfilippo@gmx.de>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      21cf377a
    • W
      tcp: disable TFO blackhole logic by default · 213ad73d
      Wei Wang 提交于
      Multiple complaints have been raised from the TFO users on the internet
      stating that the TFO blackhole logic is too aggressive and gets falsely
      triggered too often.
      (e.g. https://blog.apnic.net/2021/07/05/tcp-fast-open-not-so-fast/)
      Considering that most middleboxes no longer drop TFO packets, we decide
      to disable the blackhole logic by setting
      /proc/sys/net/ipv4/tcp_fastopen_blackhole_timeout_set to 0 by default.
      
      Fixes: cf1ef3f0 ("net/tcp_fastopen: Disable active side TFO in certain scenarios")
      Signed-off-by: NWei Wang <weiwan@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      213ad73d
    • N
      net: bridge: multicast: add context support for host-joined groups · 58d913a3
      Nikolay Aleksandrov 提交于
      Adding bridge multicast context support for host-joined groups is easy
      because we only need the proper timer value. We pass the already chosen
      context and use its timer value.
      Signed-off-by: NNikolay Aleksandrov <nikolay@nvidia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      58d913a3
    • N
      net: bridge: multicast: add mdb context support · 6567cb43
      Nikolay Aleksandrov 提交于
      Choose the proper bridge multicast context when user-spaces is adding
      mdb entries. Currently we require the vlan to be configured on at least
      one device (port or bridge) in order to add an mdb entry if vlan
      mcast snooping is enabled (vlan snooping implies vlan filtering).
      Note that we always allow deleting an entry, regardless of the vlan state.
      Signed-off-by: NNikolay Aleksandrov <nikolay@nvidia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6567cb43
    • X
      sctp: do not update transport pathmtu if SPP_PMTUD_ENABLE is not set · 02dc2ee7
      Xin Long 提交于
      Currently, in sctp_packet_config(), sctp_transport_pmtu_check() is
      called to update transport pathmtu with dst's mtu when dst's mtu
      has been changed by non sctp stack like xfrm.
      
      However, this should only happen when SPP_PMTUD_ENABLE is set, no
      matter where dst's mtu changed. This patch is to fix by checking
      SPP_PMTUD_ENABLE flag before calling sctp_transport_pmtu_check().
      
      Thanks Jacek for reporting and looking into this issue.
      
      v1->v2:
        - add the missing "{" to fix the build error.
      
      Fixes: 69fec325 ('Revert "sctp: remove sctp_transport_pmtu_check"')
      Reported-by: NJacek Szafraniec <jacek.szafraniec@nokia.com>
      Tested-by: NJacek Szafraniec <jacek.szafraniec@nokia.com>
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Acked-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      02dc2ee7
    • E
      tcp: tweak len/truesize ratio for coalesce candidates · 240bfd13
      Eric Dumazet 提交于
      tcp_grow_window() is using skb->len/skb->truesize to increase tp->rcv_ssthresh
      which has a direct impact on advertized window sizes.
      
      We added TCP coalescing in linux-3.4 & linux-3.5:
      
      Instead of storing skbs with one or two MSS in receive queue (or OFO queue),
      we try to append segments together to reduce memory overhead.
      
      High performance network drivers tend to cook skb with 3 parts :
      
      1) sk_buff structure (256 bytes)
      2) skb->head contains room to copy headers as needed, and skb_shared_info
      3) page fragment(s) containing the ~1514 bytes frame (or more depending on MTU)
      
      Once coalesced into a previous skb, 1) and 2) are freed.
      
      We can therefore tweak the way we compute len/truesize ratio knowing
      that skb->truesize is inflated by 1) and 2) soon to be freed.
      
      This is done only for in-order skb, or skb coalesced into OFO queue.
      
      The result is that low rate flows no longer pay the memory price of having
      low GRO aggregation factor. Same result for drivers not using GRO.
      
      This is critical to allow a big enough receiver window,
      typically tcp_rmem[2] / 2.
      
      We have been using this at Google for about 5 years, it is due time
      to make it upstream.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Soheil Hassas Yeganeh <soheil@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      240bfd13
    • N
      net: bridge: multicast: fix igmp/mld port context null pointer dereferences · 54cb4319
      Nikolay Aleksandrov 提交于
      With the recent change to use bridge/port multicast context pointers
      instead of bridge/port I missed to convert two locations which pass the
      port pointer as-is, but with the new model we need to verify the port
      context is non-NULL first and retrieve the port from it. The first
      location is when doing querier selection when a query is received, the
      second location is when leaving a group. The port context will be null
      if the packets originated from the bridge device (i.e. from the host).
      The fix is simple just check if the port context exists and retrieve
      the port pointer from it.
      
      Fixes: adc47037 ("net: bridge: multicast: use multicast contexts instead of bridge or port")
      Signed-off-by: NNikolay Aleksandrov <nikolay@nvidia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      54cb4319
    • E
      tcp: avoid indirect call in tcp_new_space() · 739b2adf
      Eric Dumazet 提交于
      For tcp sockets, sk->sk_write_space is most probably sk_stream_write_space().
      
      Other sk->sk_write_space() calls in TCP are slow path and do not deserve
      any change.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      739b2adf
  3. 21 7月, 2021 8 次提交
    • V
      udp: check encap socket in __udp_lib_err · 9bfce73c
      Vadim Fedorenko 提交于
      Commit d26796ae ("udp: check udp sock encap_type in __udp_lib_err")
      added checks for encapsulated sockets but it broke cases when there is
      no implementation of encap_err_lookup for encapsulation, i.e. ESP in
      UDP encapsulation. Fix it by calling encap_err_lookup only if socket
      implements this method otherwise treat it as legal socket.
      
      Fixes: d26796ae ("udp: check udp sock encap_type in __udp_lib_err")
      Signed-off-by: NVadim Fedorenko <vfedorenko@novek.ru>
      Reviewed-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9bfce73c
    • X
      sctp: update active_key for asoc when old key is being replaced · 58acd100
      Xin Long 提交于
      syzbot reported a call trace:
      
        BUG: KASAN: use-after-free in sctp_auth_shkey_hold+0x22/0xa0 net/sctp/auth.c:112
        Call Trace:
         sctp_auth_shkey_hold+0x22/0xa0 net/sctp/auth.c:112
         sctp_set_owner_w net/sctp/socket.c:131 [inline]
         sctp_sendmsg_to_asoc+0x152e/0x2180 net/sctp/socket.c:1865
         sctp_sendmsg+0x103b/0x1d30 net/sctp/socket.c:2027
         inet_sendmsg+0x99/0xe0 net/ipv4/af_inet.c:821
         sock_sendmsg_nosec net/socket.c:703 [inline]
         sock_sendmsg+0xcf/0x120 net/socket.c:723
      
      This is an use-after-free issue caused by not updating asoc->shkey after
      it was replaced in the key list asoc->endpoint_shared_keys, and the old
      key was freed.
      
      This patch is to fix by also updating active_key for asoc when old key is
      being replaced with a new one. Note that this issue doesn't exist in
      sctp_auth_del_key_id(), as it's not allowed to delete the active_key
      from the asoc.
      
      Fixes: 1b1e0bc9 ("sctp: add refcnt support for sh_key")
      Reported-by: syzbot+b774577370208727d12b@syzkaller.appspotmail.com
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      58acd100
    • V
      net: ipv4: Consolidate ipv4_mtu and ip_dst_mtu_maybe_forward · ac6627a2
      Vadim Fedorenko 提交于
      Consolidate IPv4 MTU code the same way it is done in IPv6 to have code
      aligned in both address families
      Signed-off-by: NVadim Fedorenko <vfedorenko@novek.ru>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ac6627a2
    • V
      net: ipv6: introduce ip6_dst_mtu_maybe_forward · 427faee1
      Vadim Fedorenko 提交于
      Replace ip6_dst_mtu_forward with ip6_dst_mtu_maybe_forward and
      reuse this code in ip6_mtu. Actually these two functions were
      almost duplicates, this change will simplify the maintaince of
      mtu calculation code.
      Signed-off-by: NVadim Fedorenko <vfedorenko@novek.ru>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      427faee1
    • J
      ipv6: ioam: Support for IOAM injection with lwtunnels · 3edede08
      Justin Iurman 提交于
      Add support for the IOAM inline insertion (only for the host-to-host use case)
      which is per-route configured with lightweight tunnels. The target is iproute2
      and the patch is ready. It will be posted as soon as this patchset is merged.
      Here is an overview:
      
      $ ip -6 ro ad fc00::1/128 encap ioam6 trace type 0x800000 ns 1 size 12 dev eth0
      
      This example configures an IOAM Pre-allocated Trace option attached to the
      fc00::1/128 prefix. The IOAM namespace (ns) is 1, the size of the pre-allocated
      trace data block is 12 octets (size) and only the first IOAM data (bit 0:
      hop_limit + node id) is included in the trace (type) represented as a bitfield.
      
      The reason why the in-transit (IPv6-in-IPv6 encapsulation) use case is not
      implemented is explained on the patchset cover.
      Signed-off-by: NJustin Iurman <justin.iurman@uliege.be>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3edede08
    • J
      ipv6: ioam: IOAM Generic Netlink API · 8c6f6fa6
      Justin Iurman 提交于
      Add Generic Netlink commands to allow userspace to configure IOAM
      namespaces and schemas. The target is iproute2 and the patch is ready.
      It will be posted as soon as this patchset is merged. Here is an overview:
      
      $ ip ioam
      Usage:	ip ioam { COMMAND | help }
      	ip ioam namespace show
      	ip ioam namespace add ID [ data DATA32 ] [ wide DATA64 ]
      	ip ioam namespace del ID
      	ip ioam schema show
      	ip ioam schema add ID DATA
      	ip ioam schema del ID
      	ip ioam namespace set ID schema { ID | none }
      Signed-off-by: NJustin Iurman <justin.iurman@uliege.be>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8c6f6fa6
    • J
      ipv6: ioam: Data plane support for Pre-allocated Trace · 9ee11f0f
      Justin Iurman 提交于
      Implement support for processing the IOAM Pre-allocated Trace with IPv6,
      see [1] and [2]. Introduce a new IPv6 Hop-by-Hop TLV option, see IANA [3].
      
      A new per-interface sysctl is introduced. The value is a boolean to accept (=1)
      or ignore (=0, by default) IPv6 IOAM options on ingress for an interface:
       - net.ipv6.conf.XXX.ioam6_enabled
      
      Two other sysctls are introduced to define IOAM IDs, represented by an integer.
      They are respectively per-namespace and per-interface:
       - net.ipv6.ioam6_id
       - net.ipv6.conf.XXX.ioam6_id
      
      The value of the first one represents the IOAM ID of the node itself (u32; max
      and default value = U32_MAX>>8, due to hop limit concatenation) while the other
      represents the IOAM ID of an interface (u16; max and default value = U16_MAX).
      
      Each "ioam6_id" sysctl has a "_wide" equivalent:
       - net.ipv6.ioam6_id_wide
       - net.ipv6.conf.XXX.ioam6_id_wide
      
      The value of the first one represents the wide IOAM ID of the node itself (u64;
      max and default value = U64_MAX>>8, due to hop limit concatenation) while the
      other represents the wide IOAM ID of an interface (u32; max and default value
      = U32_MAX).
      
      The use of short and wide equivalents is not exclusive, a deployment could
      choose to leverage both. For example, net.ipv6.conf.XXX.ioam6_id (short format)
      could be an identifier for a physical interface, whereas
      net.ipv6.conf.XXX.ioam6_id_wide (wide format) could be an identifier for a
      logical sub-interface. Documentation about new sysctls is provided at the end
      of this patchset.
      
      Two relativistic hash tables are used: one for IOAM namespaces, the other for
      IOAM schemas. A namespace can only have a single active schema and a schema
      can only be attached to a single namespace (1:1 relationship).
      
        [1] https://tools.ietf.org/html/draft-ietf-ippm-ioam-ipv6-options
        [2] https://tools.ietf.org/html/draft-ietf-ippm-ioam-data
        [3] https://www.iana.org/assignments/ipv6-parameters/ipv6-parameters.xhtml#ipv6-parameters-2Signed-off-by: NJustin Iurman <justin.iurman@uliege.be>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9ee11f0f
    • V
      net: switchdev: recurse into __switchdev_handle_fdb_del_to_device · 71f4f89a
      Vladimir Oltean 提交于
      The difference between __switchdev_handle_fdb_del_to_device and
      switchdev_handle_del_to_device is that the former takes an extra
      orig_dev argument, while the latter starts with dev == orig_dev.
      
      We should recurse into the variant that does not lose the orig_dev along
      the way. This is relevant when deleting FDB entries pointing towards a
      bridge (dev changes to the lower interfaces, but orig_dev shouldn't).
      
      The addition helper already recurses properly, just the deletion one
      doesn't.
      
      Fixes: 8ca07176 ("net: switchdev: introduce a fanout helper for SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE")
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      71f4f89a
  4. 20 7月, 2021 9 次提交
    • P
      ipv6: fix another slab-out-of-bounds in fib6_nh_flush_exceptions · 8fb4792f
      Paolo Abeni 提交于
      While running the self-tests on a KASAN enabled kernel, I observed a
      slab-out-of-bounds splat very similar to the one reported in
      commit 821bbf79 ("ipv6: Fix KASAN: slab-out-of-bounds Read in
       fib6_nh_flush_exceptions").
      
      We additionally need to take care of fib6_metrics initialization
      failure when the caller provides an nh.
      
      The fix is similar, explicitly free the route instead of calling
      fib6_info_release on a half-initialized object.
      
      Fixes: f88d8ea6 ("ipv6: Plumb support for nexthop object in a fib6_info")
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8fb4792f
    • Y
      net: ipv4: add capability check for net administration · 8292d7f6
      Yang Yang 提交于
      Root in init user namespace can modify /proc/sys/net/ipv4/ip_forward
      without CAP_NET_ADMIN, this doesn't follow the principle of
      capabilities. For example, let's take a look at netdev_store(),
      root can't modify netdev attribute without CAP_NET_ADMIN.
      So let's keep the consistency of permission check logic.
      Reported-by: NZeal Robot <zealci@zte.com.cn>
      Signed-off-by: NYang Yang <yang.yang29@zte.com.cn>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8292d7f6
    • P
      net/sched: act_skbmod: Skip non-Ethernet packets · 727d6a8b
      Peilin Ye 提交于
      Currently tcf_skbmod_act() assumes that packets use Ethernet as their L2
      protocol, which is not always the case.  As an example, for CAN devices:
      
      	$ ip link add dev vcan0 type vcan
      	$ ip link set up vcan0
      	$ tc qdisc add dev vcan0 root handle 1: htb
      	$ tc filter add dev vcan0 parent 1: protocol ip prio 10 \
      		matchall action skbmod swap mac
      
      Doing the above silently corrupts all the packets.  Do not perform skbmod
      actions for non-Ethernet packets.
      
      Fixes: 86da71b5 ("net_sched: Introduce skbmod action")
      Reviewed-by: NCong Wang <cong.wang@bytedance.com>
      Signed-off-by: NPeilin Ye <peilin.ye@bytedance.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      727d6a8b
    • V
      net: dsa: use switchdev_handle_fdb_{add,del}_to_device · b94dc99c
      Vladimir Oltean 提交于
      Using the new fan-out helper for FDB entries installed on the software
      bridge, we can install host addresses with the proper refcount on the
      CPU port, such that this case:
      
      ip link set swp0 master br0
      ip link set swp1 master br0
      ip link set swp2 master br0
      ip link set swp3 master br0
      ip link set br0 address 00:01:02:03:04:05
      ip link set swp3 nomaster
      
      works properly and the br0 address remains installed as a host entry
      with refcount 3 instead of getting deleted.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b94dc99c
    • V
      net: switchdev: introduce a fanout helper for SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE · 8ca07176
      Vladimir Oltean 提交于
      Currently DSA has an issue with FDB entries pointing towards the bridge
      in the presence of br_fdb_replay() being called at port join and leave
      time.
      
      In particular, each bridge port will ask for a replay for the FDB
      entries pointing towards the bridge when it joins, and for another
      replay when it leaves.
      
      This means that for example, a bridge with 4 switch ports will notify
      DSA 4 times of the bridge MAC address.
      
      But if the MAC address of the bridge changes during the normal runtime
      of the system, the bridge notifies switchdev [ once ] of the deletion of
      the old MAC address as a local FDB towards the bridge, and of the
      insertion [ again once ] of the new MAC address as a local FDB.
      
      This is a problem, because DSA keeps the old MAC address as a host FDB
      entry with refcount 4 (4 ports asked for it using br_fdb_replay). So the
      old MAC address will not be deleted. Additionally, the new MAC address
      will only be installed with refcount 1, and when the first switch port
      leaves the bridge (leaving 3 others as still members), it will delete
      with it the new MAC address of the bridge from the local FDB entries
      kept by DSA (because the br_fdb_replay call on deletion will bring the
      entry's refcount from 1 to 0).
      
      So the problem, really, is that the number of br_fdb_replay() calls is
      not matched with the refcount that a host FDB is offloaded to DSA during
      normal runtime.
      
      An elegant way to solve the problem would be to make the switchdev
      notification emitted by br_fdb_change_mac_address() result in a host FDB
      kept by DSA which has a refcount exactly equal to the number of ports
      under that bridge. Then, no matter how many DSA ports join or leave that
      bridge, the host FDB entry will always be deleted when there are exactly
      zero remaining DSA switch ports members of the bridge.
      
      To implement the proposed solution, we remember that the switchdev
      objects and port attributes have some helpers provided by switchdev,
      which can be optionally called by drivers:
      switchdev_handle_port_obj_{add,del} and switchdev_handle_port_attr_set.
      These helpers:
      - fan out a switchdev object/attribute emitted for the bridge towards
        all the lower interfaces that pass the check_cb().
      - fan out a switchdev object/attribute emitted for a bridge port that is
        a LAG towards all the lower interfaces that pass the check_cb().
      
      In other words, this is the model we need for the FDB events too:
      something that will keep an FDB entry emitted towards a physical port as
      it is, but translate an FDB entry emitted towards the bridge into N FDB
      entries, one per physical port.
      
      Of course, there are many differences between fanning out a switchdev
      object (VLAN) on 3 lower interfaces of a LAG and fanning out an FDB
      entry on 3 lower interfaces of a LAG. Intuitively, an FDB entry towards
      a LAG should be treated specially, because FDB entries are unicast, we
      can't just install the same address towards 3 destinations. It is
      imaginable that drivers might want to treat this case specifically, so
      create some methods for this case and do not recurse into the LAG lower
      ports, just the bridge ports.
      
      DSA also listens for FDB entries on "foreign" interfaces, aka interfaces
      bridged with us which are not part of our hardware domain: think an
      Ethernet switch bridged with a Wi-Fi AP. For those addresses, DSA
      installs host FDB entries. However, there we have the same problem
      (those host FDB entries are installed with a refcount of only 1) and an
      even bigger one which we did not have with FDB entries towards the
      bridge:
      
      br_fdb_replay() is currently not called for FDB entries on foreign
      interfaces, just for the physical port and for the bridge itself.
      
      So when DSA sniffs an address learned by the software bridge towards a
      foreign interface like an e1000 port, and then that e1000 leaves the
      bridge, DSA remains with the dangling host FDB address. That will be
      fixed separately by replaying all FDB entries and not just the ones
      towards the port and the bridge.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8ca07176
    • V
      net: switchdev: introduce helper for checking dynamically learned FDB entries · c6451cda
      Vladimir Oltean 提交于
      It is a bit difficult to understand what DSA checks when it tries to
      avoid installing dynamically learned addresses on foreign interfaces as
      local host addresses, so create a generic switchdev helper that can be
      reused and is generally more readable.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c6451cda
    • V
      net: dsa: tag_8021q: add proper cross-chip notifier support · c64b9c05
      Vladimir Oltean 提交于
      The big problem which mandates cross-chip notifiers for tag_8021q is
      this:
      
                                                   |
          sw0p0     sw0p1     sw0p2     sw0p3     sw0p4
       [  user ] [  user ] [  user ] [  dsa  ] [  cpu  ]
                                         |
                                         +---------+
                                                   |
          sw1p0     sw1p1     sw1p2     sw1p3     sw1p4
       [  user ] [  user ] [  user ] [  dsa  ] [  dsa  ]
                                         |
                                         +---------+
                                                   |
          sw2p0     sw2p1     sw2p2     sw2p3     sw2p4
       [  user ] [  user ] [  user ] [  dsa  ] [  dsa  ]
      
      When the user runs:
      
      ip link add br0 type bridge
      ip link set sw0p0 master br0
      ip link set sw2p0 master br0
      
      It doesn't work.
      
      This is because dsa_8021q_crosschip_bridge_join() assumes that "ds" and
      "other_ds" are at most 1 hop away from each other, so it is sufficient
      to add the RX VLAN of {ds, port} into {other_ds, other_port} and vice
      versa and presto, the cross-chip link works. When there is another
      switch in the middle, such as in this case switch 1 with its DSA links
      sw1p3 and sw1p4, somebody needs to tell it about these VLANs too.
      
      Which is exactly why the problem is quadratic: when a port joins a
      bridge, for each port in the tree that's already in that same bridge we
      notify a tag_8021q VLAN addition of that port's RX VLAN to the entire
      tree. It is a very complicated web of VLANs.
      
      It must be mentioned that currently we install tag_8021q VLANs on too
      many ports (DSA links - to be precise, on all of them). For example,
      when sw2p0 joins br0, and assuming sw1p0 was part of br0 too, we add the
      RX VLAN of sw2p0 on the DSA links of switch 0 too, even though there
      isn't any port of switch 0 that is a member of br0 (at least yet).
      In theory we could notify only the switches which sit in between the
      port joining the bridge and the port reacting to that bridge_join event.
      But in practice that is impossible, because of the way 'link' properties
      are described in the device tree. The DSA bindings require DT writers to
      list out not only the real/physical DSA links, but in fact the entire
      routing table, like for example switch 0 above will have:
      
      	sw0p3: port@3 {
      		link = <&sw1p4 &sw2p4>;
      	};
      
      This was done because:
      
      /* TODO: ideally DSA ports would have a single dp->link_dp member,
       * and no dst->rtable nor this struct dsa_link would be needed,
       * but this would require some more complex tree walking,
       * so keep it stupid at the moment and list them all.
       */
      
      but it is a perfect example of a situation where too much information is
      actively detrimential, because we are now in the position where we
      cannot distinguish a real DSA link from one that is put there to avoid
      the 'complex tree walking'. And because DT is ABI, there is not much we
      can change.
      
      And because we do not know which DSA links are real and which ones
      aren't, we can't really know if DSA switch A is in the data path between
      switches B and C, in the general case.
      
      So this is why tag_8021q RX VLANs are added on all DSA links, and
      probably why it will never change.
      
      On the other hand, at least the number of additions/deletions is well
      balanced, and this means that once we implement reference counting at
      the cross-chip notifier level a la fdb/mdb, there is absolutely zero
      need for a struct dsa_8021q_crosschip_link, it's all self-managing.
      
      In fact, with the tag_8021q notifiers emitted from the bridge join
      notifiers, it becomes so generic that sja1105 does not need to do
      anything anymore, we can just delete its implementation of the
      .crosschip_bridge_{join,leave} methods.
      
      Among other things we can simply delete is the home-grown implementation
      of sja1105_notify_crosschip_switches(). The reason why that is wrong is
      because it is not quadratic - it only covers remote switches to which we
      have a cross-chip bridging link and that does not cover in-between
      switches. This deletion is part of the same patch because sja1105 used
      to poke deep inside the guts of the tag_8021q context in order to do
      that. Because the cross-chip links went away, so needs the sja1105 code.
      
      Last but not least, dsa_8021q_setup_port() is simplified (and also
      renamed). Because our TAG_8021Q_VLAN_ADD notifier is designed to react
      on the CPU port too, the four dsa_8021q_vid_apply() calls:
      - 1 for RX VLAN on user port
      - 1 for the user port's RX VLAN on the CPU port
      - 1 for TX VLAN on user port
      - 1 for the user port's TX VLAN on the CPU port
      
      now get squashed into only 2 notifier calls via
      dsa_port_tag_8021q_vlan_add.
      
      And because the notifiers to add and to delete a tag_8021q VLAN are
      distinct, now we finally break up the port setup and teardown into
      separate functions instead of relying on a "bool enabled" flag which
      tells us what to do. Arguably it should have been this way from the
      get go.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c64b9c05
    • V
      net: dsa: tag_8021q: manage RX VLANs dynamically at bridge join/leave time · e19cc13c
      Vladimir Oltean 提交于
      There has been at least one wasted opportunity for tag_8021q to be used
      by a driver:
      
      https://patchwork.ozlabs.org/project/netdev/patch/20200710113611.3398-3-kurt@linutronix.de/#2484272
      
      because of a design decision: the declared purpose of tag_8021q is to
      offer source port/switch identification for a tagging driver for packets
      coming from a switch with no hardware DSA tagging support. It is not
      intended to provide VLAN-based port isolation, because its first user,
      sja1105, had another mechanism for bridging domain isolation, the L2
      Forwarding Table. So even if 2 ports are in the same VLAN but they are
      separated via the L2 Forwarding Table, they will not communicate with
      one another. The L2 Forwarding Table is managed by the
      sja1105_bridge_join() and sja1105_bridge_leave() methods.
      
      As a consequence, today tag_8021q does not bother too much with hooking
      into .port_bridge_join() and .port_bridge_leave() because that would
      introduce yet another degree of freedom, it just iterates statically
      through all ports of a switch and adds the RX VLAN of one port to all
      the others. In this way, whenever .port_bridge_join() is called,
      bridging will magically work because the RX VLANs are already installed
      everywhere they need to be.
      
      This is not to say that the reason for the change in this patch is to
      satisfy the hellcreek and similar use cases, that is merely a nice side
      effect. Instead it is to make sja1105 cross-chip links work properly
      over a DSA link.
      
      For context, sja1105 today supports a degenerate form of cross-chip
      bridging, where the switches are interconnected through their CPU ports
      ("disjoint trees" topology). There is some code which has been
      generalized into dsa_8021q_crosschip_link_{add,del}, but it is not
      enough, and frankly it is impossible to build upon that.
      Real multi-switch DSA trees, like daisy chains or H trees, which have
      actual DSA links, do not work.
      
      The problem is that sja1105 is unlike mv88e6xxx, and does not have a PVT
      for cross-chip bridging, which is a table by which the local switch can
      select the forwarding domain for packets from a certain ingress switch
      ID and source port. The sja1105 switches cannot parse their own DSA
      tags, because, well, they don't really have support for DSA tags, it's
      all VLANs.
      
      So to make something like cross-chip bridging between sw0p0 and sw1p0 to
      work over the sw0p3/sw1p3 DSA link to work with sja1105 in the topology
      below:
      
                               |                                  |
          sw0p0     sw0p1     sw0p2     sw0p3          sw1p3     sw1p2     sw1p1     sw1p0
       [  user ] [  user ] [  cpu  ] [  dsa  ] ---- [  dsa  ] [  cpu  ] [  user ] [  user ]
      
      we need to ask ourselves 2 questions:
      
      (1) how should the L2 Forwarding Table be managed?
      (2) how should the VLAN Lookup Table be managed?
      
      i.e. what should prevent packets from going to unwanted ports?
      
      Since as mentioned, there is no PVT, the L2 Forwarding Table only
      contains forwarding rules for local ports. So we can say "all user ports
      are allowed to forward to all CPU ports and all DSA links".
      
      If we allow forwarding to DSA links unconditionally, this means we must
      prevent forwarding using the VLAN Lookup Table. This is in fact
      asymmetric with what we do for tag_8021q on ports local to the same
      switch, and it matters because now that we are making tag_8021q a core
      DSA feature, we need to hook into .crosschip_bridge_join() to add/remove
      the tag_8021q VLANs. So for symmetry it makes sense to manage the VLANs
      for local forwarding in the same way as cross-chip forwarding.
      
      Note that there is a very precise reason why tag_8021q hooks into
      dsa_switch_bridge_join() which acts at the cross-chip notifier level,
      and not at a higher level such as dsa_port_bridge_join(). We need to
      install the RX VLAN of the newly joining port into the VLAN table of all
      the existing ports across the tree that are part of the same bridge, and
      the notifier already does the iteration through the switches for us.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e19cc13c
    • V
      net: dsa: tag_8021q: absorb dsa_8021q_setup into dsa_tag_8021q_{,un}register · 328621f6
      Vladimir Oltean 提交于
      Right now, setting up tag_8021q is a 2-step operation for a driver,
      first the context structure needs to be created, then the VLANs need to
      be installed on the ports. A similar thing is true for teardown.
      
      Merge the 2 steps into the register/unregister methods, to be as
      transparent as possible for the driver as to what tag_8021q does behind
      the scenes. This also gets rid of the funny "bool setup == true means
      setup, == false means teardown" API that tag_8021q used to expose.
      
      Note that dsa_tag_8021q_register() must be called at least in the
      .setup() driver method and never earlier (like in the driver probe
      function). This is because the DSA switch tree is not initialized at
      probe time, and the cross-chip notifiers will not work.
      
      For symmetry with .setup(), the unregister method should be put in
      .teardown().
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      328621f6