1. 12 1月, 2021 1 次提交
    • V
      net: switchdev: remove vid_begin -> vid_end range from VLAN objects · b7a9e0da
      Vladimir Oltean 提交于
      The call path of a switchdev VLAN addition to the bridge looks something
      like this today:
      
              nbp_vlan_init
              |  __br_vlan_set_default_pvid
              |  |                       |
              |  |    br_afspec          |
              |  |        |              |
              |  |        v              |
              |  | br_process_vlan_info  |
              |  |        |              |
              |  |        v              |
              |  |   br_vlan_info        |
              |  |       / \            /
              |  |      /   \          /
              |  |     /     \        /
              |  |    /       \      /
              v  v   v         v    v
            nbp_vlan_add   br_vlan_add ------+
             |              ^      ^ |       |
             |             /       | |       |
             |            /       /  /       |
             \ br_vlan_get_master/  /        v
              \        ^        /  /  br_vlan_add_existing
               \       |       /  /          |
                \      |      /  /          /
                 \     |     /  /          /
                  \    |    /  /          /
                   \   |   /  /          /
                    v  |   | v          /
                    __vlan_add         /
                       / |            /
                      /  |           /
                     v   |          /
         __vlan_vid_add  |         /
                     \   |        /
                      v  v        v
            br_switchdev_port_vlan_add
      
      The ranges UAPI was introduced to the bridge in commit bdced7ef
      ("bridge: support for multiple vlans and vlan ranges in setlink and
      dellink requests") (Jan 10 2015). But the VLAN ranges (parsed in br_afspec)
      have always been passed one by one, through struct bridge_vlan_info
      tmp_vinfo, to br_vlan_info. So the range never went too far in depth.
      
      Then Scott Feldman introduced the switchdev_port_bridge_setlink function
      in commit 47f8328b ("switchdev: add new switchdev bridge setlink").
      That marked the introduction of the SWITCHDEV_OBJ_PORT_VLAN, which made
      full use of the range. But switchdev_port_bridge_setlink was called like
      this:
      
      br_setlink
      -> br_afspec
      -> switchdev_port_bridge_setlink
      
      Basically, the switchdev and the bridge code were not tightly integrated.
      Then commit 41c498b9 ("bridge: restore br_setlink back to original")
      came, and switchdev drivers were required to implement
      .ndo_bridge_setlink = switchdev_port_bridge_setlink for a while.
      
      In the meantime, commits such as 0944d6b5 ("bridge: try switchdev op
      first in __vlan_vid_add/del") finally made switchdev penetrate the
      br_vlan_info() barrier and start to develop the call path we have today.
      But remember, br_vlan_info() still receives VLANs one by one.
      
      Then Arkadi Sharshevsky refactored the switchdev API in 2017 in commit
      29ab586c ("net: switchdev: Remove bridge bypass support from
      switchdev") so that drivers would not implement .ndo_bridge_setlink any
      longer. The switchdev_port_bridge_setlink also got deleted.
      This refactoring removed the parallel bridge_setlink implementation from
      switchdev, and left the only switchdev VLAN objects to be the ones
      offloaded from __vlan_vid_add (basically RX filtering) and  __vlan_add
      (the latter coming from commit 9c86ce2c ("net: bridge: Notify about
      bridge VLANs")).
      
      That is to say, today the switchdev VLAN object ranges are not used in
      the kernel. Refactoring the above call path is a bit complicated, when
      the bridge VLAN call path is already a bit complicated.
      
      Let's go off and finish the job of commit 29ab586c by deleting the
      bogus iteration through the VLAN ranges from the drivers. Some aspects
      of this feature never made too much sense in the first place. For
      example, what is a range of VLANs all having the BRIDGE_VLAN_INFO_PVID
      flag supposed to mean, when a port can obviously have a single pvid?
      This particular configuration _is_ denied as of commit 6623c60d
      ("bridge: vlan: enforce no pvid flag in vlan ranges"), but from an API
      perspective, the driver still has to play pretend, and only offload the
      vlan->vid_end as pvid. And the addition of a switchdev VLAN object can
      modify the flags of another, completely unrelated, switchdev VLAN
      object! (a VLAN that is PVID will invalidate the PVID flag from whatever
      other VLAN had previously been offloaded with switchdev and had that
      flag. Yet switchdev never notifies about that change, drivers are
      supposed to guess).
      
      Nonetheless, having a VLAN range in the API makes error handling look
      scarier than it really is - unwinding on errors and all of that.
      When in reality, no one really calls this API with more than one VLAN.
      It is all unnecessary complexity.
      
      And despite appearing pretentious (two-phase transactional model and
      all), the switchdev API is really sloppy because the VLAN addition and
      removal operations are not paired with one another (you can add a VLAN
      100 times and delete it just once). The bridge notifies through
      switchdev of a VLAN addition not only when the flags of an existing VLAN
      change, but also when nothing changes. There are switchdev drivers out
      there who don't like adding a VLAN that has already been added, and
      those checks don't really belong at driver level. But the fact that the
      API contains ranges is yet another factor that prevents this from being
      addressed in the future.
      
      Of the existing switchdev pieces of hardware, it appears that only
      Mellanox Spectrum supports offloading more than one VLAN at a time,
      through mlxsw_sp_port_vlan_set. I have kept that code internal to the
      driver, because there is some more bookkeeping that makes use of it, but
      I deleted it from the switchdev API. But since the switchdev support for
      ranges has already been de facto deleted by a Mellanox employee and
      nobody noticed for 4 years, I'm going to assume it's not a biggie.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: Ido Schimmel <idosch@nvidia.com> # switchdev and mlxsw
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> # hellcreek
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      b7a9e0da
  2. 10 1月, 2021 8 次提交
  3. 09 1月, 2021 1 次提交
    • J
      net: ip_tunnel: clean up endianness conversions · fda4fde2
      Julian Wiedmann 提交于
      sparse complains about some harmless endianness issues:
      
      > net/ipv4/ip_tunnel_core.c:225:43: warning: cast to restricted __be16
      > net/ipv4/ip_tunnel_core.c:225:43: warning: incorrect type in initializer (different base types)
      > net/ipv4/ip_tunnel_core.c:225:43:    expected restricted __be16 [usertype] mtu
      > net/ipv4/ip_tunnel_core.c:225:43:    got unsigned short [usertype]
      
      iptunnel_pmtud_build_icmp() uses the wrong flavour of byte-order conversion
      when storing the MTU into the ICMPv4 packet. Use htons(), just like
      iptunnel_pmtud_build_icmpv6() does.
      
      > net/ipv4/ip_tunnel_core.c:248:35: warning: cast from restricted __be16
      > net/ipv4/ip_tunnel_core.c:248:35: warning: incorrect type in argument 3 (different base types)
      > net/ipv4/ip_tunnel_core.c:248:35:    expected unsigned short type
      > net/ipv4/ip_tunnel_core.c:248:35:    got restricted __be16 [usertype]
      > net/ipv4/ip_tunnel_core.c:341:35: warning: cast from restricted __be16
      > net/ipv4/ip_tunnel_core.c:341:35: warning: incorrect type in argument 3 (different base types)
      > net/ipv4/ip_tunnel_core.c:341:35:    expected unsigned short type
      > net/ipv4/ip_tunnel_core.c:341:35:    got restricted __be16 [usertype]
      
      eth_header() wants the Ethertype in host-order, use the correct flavour of
      byte-order conversion.
      
      > net/ipv4/ip_tunnel_core.c:600:45: warning: restricted __be16 degrades to integer
      > net/ipv4/ip_tunnel_core.c:609:30: warning: incorrect type in assignment (different base types)
      > net/ipv4/ip_tunnel_core.c:609:30:    expected int type
      > net/ipv4/ip_tunnel_core.c:609:30:    got restricted __be16 [usertype]
      > net/ipv4/ip_tunnel_core.c:619:30: warning: incorrect type in assignment (different base types)
      > net/ipv4/ip_tunnel_core.c:619:30:    expected int type
      > net/ipv4/ip_tunnel_core.c:619:30:    got restricted __be16 [usertype]
      > net/ipv4/ip_tunnel_core.c:629:30: warning: incorrect type in assignment (different base types)
      > net/ipv4/ip_tunnel_core.c:629:30:    expected int type
      > net/ipv4/ip_tunnel_core.c:629:30:    got restricted __be16 [usertype]
      
      The TUNNEL_* types are big-endian, so adjust the type of the local
      variable in ip_tun_parse_opts().
      Signed-off-by: NJulian Wiedmann <jwi@linux.ibm.com>
      Link: https://lore.kernel.org/r/20210107144008.25777-1-jwi@linux.ibm.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      fda4fde2
  4. 08 1月, 2021 28 次提交
    • P
      nexthop: Bounce NHA_GATEWAY in FDB nexthop groups · b19218b2
      Petr Machata 提交于
      The function nh_check_attr_group() is called to validate nexthop groups.
      The intention of that code seems to have been to bounce all attributes
      above NHA_GROUP_TYPE except for NHA_FDB. However instead it bounces all
      these attributes except when NHA_FDB attribute is present--then it accepts
      them.
      
      NHA_FDB validation that takes place before, in rtm_to_nh_config(), already
      bounces NHA_OIF, NHA_BLACKHOLE, NHA_ENCAP and NHA_ENCAP_TYPE. Yet further
      back, NHA_GROUPS and NHA_MASTER are bounced unconditionally.
      
      But that still leaves NHA_GATEWAY as an attribute that would be accepted in
      FDB nexthop groups (with no meaning), so long as it keeps the address
      family as unspecified:
      
       # ip nexthop add id 1 fdb via 127.0.0.1
       # ip nexthop add id 10 fdb via default group 1
      
      The nexthop code is still relatively new and likely not used very broadly,
      and the FDB bits are newer still. Even though there is a reproducer out
      there, it relies on an improbable gateway arguments "via default", "via
      all" or "via any". Given all this, I believe it is OK to reformulate the
      condition to do the right thing and bounce NHA_GATEWAY.
      
      Fixes: 38428d68 ("nexthop: support for fdb ecmp nexthops")
      Signed-off-by: NPetr Machata <petrm@nvidia.com>
      Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      b19218b2
    • I
      nexthop: Unlink nexthop group entry in error path · 7b01e53e
      Ido Schimmel 提交于
      In case of error, remove the nexthop group entry from the list to which
      it was previously added.
      
      Fixes: 430a0491 ("nexthop: Add support for nexthop groups")
      Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: NPetr Machata <petrm@nvidia.com>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      7b01e53e
    • I
      nexthop: Fix off-by-one error in error path · 07e61a97
      Ido Schimmel 提交于
      A reference was not taken for the current nexthop entry, so do not try
      to put it in the error path.
      
      Fixes: 430a0491 ("nexthop: Add support for nexthop groups")
      Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: NPetr Machata <petrm@nvidia.com>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      07e61a97
    • J
      skbuff: Rename skb_zcopy_{get|put} to net_zcopy_{get|put} · 8e044917
      Jonathan Lemon 提交于
      Unlike the rest of the skb_zcopy_ functions, these routines
      operate on a 'struct ubuf', not a skb.  Remove the 'skb_'
      prefix from the naming to make things clearer.
      Suggested-by: NWillem de Bruijn <willemdebruijn.kernel@gmail.com>
      Signed-off-by: NJonathan Lemon <jonathan.lemon@gmail.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      8e044917
    • J
      skbuff: add flags to ubuf_info for ubuf setup · 04c2d33e
      Jonathan Lemon 提交于
      Currently, when an ubuf is attached to a new skb, the shared
      flags word is initialized to a fixed value.  Instead of doing
      this, set the default flags in the ubuf, and have new skbs
      inherit from this default.
      
      This is needed when setting up different zerocopy types.
      Signed-off-by: NJonathan Lemon <jonathan.lemon@gmail.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      04c2d33e
    • J
      net: group skb_shinfo zerocopy related bits together. · 06b4feb3
      Jonathan Lemon 提交于
      In preparation for expanded zerocopy (TX and RX), move
      the zerocopy related bits out of tx_flags into their own
      flag word.
      Signed-off-by: NJonathan Lemon <jonathan.lemon@gmail.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      06b4feb3
    • J
      skbuff: rename sock_zerocopy_* to msg_zerocopy_* · 8c793822
      Jonathan Lemon 提交于
      At Willem's suggestion, rename the sock_zerocopy_* functions
      so that they match the MSG_ZEROCOPY flag, which makes it clear
      they are specific to this zerocopy implementation.
      Signed-off-by: NJonathan Lemon <jonathan.lemon@gmail.com>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      8c793822
    • J
      skbuff: Call skb_zcopy_clear() before unref'ing fragments · 70c43167
      Jonathan Lemon 提交于
      RX zerocopy fragment pages which are not allocated from the
      system page pool require special handling.  Give the callback
      in skb_zcopy_clear() a chance to process them first.
      Signed-off-by: NJonathan Lemon <jonathan.lemon@gmail.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      70c43167
    • J
      skbuff: Call sock_zerocopy_put_abort from skb_zcopy_put_abort · 236a6b1c
      Jonathan Lemon 提交于
      The sock_zerocopy_put_abort function contains logic which is
      specific to the current zerocopy implementation.  Add a wrapper
      which checks the callback and dispatches apppropriately.
      Signed-off-by: NJonathan Lemon <jonathan.lemon@gmail.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      236a6b1c
    • J
      skbuff: Add skb parameter to the ubuf zerocopy callback · 36177832
      Jonathan Lemon 提交于
      Add an optional skb parameter to the zerocopy callback parameter,
      which is passed down from skb_zcopy_clear().  This gives access
      to the original skb, which is needed for upcoming RX zero-copy
      error handling.
      Signed-off-by: NJonathan Lemon <jonathan.lemon@gmail.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      36177832
    • J
      skbuff: replace sock_zerocopy_get with skb_zcopy_get · e76d46cf
      Jonathan Lemon 提交于
      Rename the get routines for consistency.
      Signed-off-by: NJonathan Lemon <jonathan.lemon@gmail.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      e76d46cf
    • J
      skbuff: replace sock_zerocopy_put() with skb_zcopy_put() · 59776362
      Jonathan Lemon 提交于
      Replace sock_zerocopy_put with the generic skb_zcopy_put()
      function.  Pass 'true' as the success argument, as this
      is identical to no change.
      Signed-off-by: NJonathan Lemon <jonathan.lemon@gmail.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      59776362
    • J
      skbuff: Push status and refcounts into sock_zerocopy_callback · 75518851
      Jonathan Lemon 提交于
      Before this change, the caller of sock_zerocopy_callback would
      need to save the zerocopy status, decrement and check the refcount,
      and then call the callback function - the callback was only invoked
      when the refcount reached zero.
      
      Now, the caller just passes the status into the callback function,
      which saves the status and handles its own refcounts.
      
      This makes the behavior of the sock_zerocopy_callback identical
      to the tpacket and vhost callbacks.
      Signed-off-by: NJonathan Lemon <jonathan.lemon@gmail.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      75518851
    • J
      skbuff: simplify sock_zerocopy_put · d6adf1b1
      Jonathan Lemon 提交于
      All 'struct ubuf_info' users should have a callback defined
      as of commit 0a4a060b ("sock: fix zerocopy_success regression
      with msg_zerocopy").
      
      Remove the dead code path to consume_skb(), which makes
      assumptions about how the structure was allocated.
      Signed-off-by: NJonathan Lemon <jonathan.lemon@gmail.com>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      d6adf1b1
    • V
      net: dsa: remove the DSA specific notifiers · 1dbb1302
      Vladimir Oltean 提交于
      This effectively reverts commit 60724d4b ("net: dsa: Add support for
      DSA specific notifiers"). The reason is that since commit 2f1e8ea7
      ("net: dsa: link interfaces with the DSA master to get rid of lockdep
      warnings"), it appears that there is a generic way to achieve the same
      purpose. The only user thus far, the Broadcom SYSTEMPORT driver, was
      converted to use the generic notifiers.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Acked-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      1dbb1302
    • V
      net: dsa: export dsa_slave_dev_check · a5e3c9ba
      Vladimir Oltean 提交于
      Using the NETDEV_CHANGEUPPER notifications, drivers can be aware when
      they are enslaved to e.g. a bridge by calling netif_is_bridge_master().
      
      Export this helper from DSA to get the equivalent functionality of
      determining whether the upper interface of a CHANGEUPPER notifier is a
      DSA switch interface or not.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Acked-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      a5e3c9ba
    • V
      net: dsa: move the Broadcom tag information in a separate header file · f46b9b8e
      Vladimir Oltean 提交于
      It is a bit strange to see something as specific as Broadcom SYSTEMPORT
      bits in the main DSA include file. Move these away into a separate
      header, and have the tagger and the SYSTEMPORT driver include them.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Acked-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      f46b9b8e
    • V
      net: dsa: listen for SWITCHDEV_{FDB,DEL}_ADD_TO_DEVICE on foreign bridge neighbors · d5f19486
      Vladimir Oltean 提交于
      Some DSA switches (and not only) cannot learn source MAC addresses from
      packets injected from the CPU. They only perform hardware address
      learning from inbound traffic.
      
      This can be problematic when we have a bridge spanning some DSA switch
      ports and some non-DSA ports (which we'll call "foreign interfaces" from
      DSA's perspective).
      
      There are 2 classes of problems created by the lack of learning on
      CPU-injected traffic:
      - excessive flooding, due to the fact that DSA treats those addresses as
        unknown
      - the risk of stale routes, which can lead to temporary packet loss
      
      To illustrate the second class, consider the following situation, which
      is common in production equipment (wireless access points, where there
      is a WLAN interface and an Ethernet switch, and these form a single
      bridging domain).
      
       AP 1:
       +------------------------------------------------------------------------+
       |                                          br0                           |
       +------------------------------------------------------------------------+
       +------------+ +------------+ +------------+ +------------+ +------------+
       |    swp0    | |    swp1    | |    swp2    | |    swp3    | |    wlan0   |
       +------------+ +------------+ +------------+ +------------+ +------------+
             |                                                       ^        ^
             |                                                       |        |
             |                                                       |        |
             |                                                    Client A  Client B
             |
             |
             |
       +------------+ +------------+ +------------+ +------------+ +------------+
       |    swp0    | |    swp1    | |    swp2    | |    swp3    | |    wlan0   |
       +------------+ +------------+ +------------+ +------------+ +------------+
       +------------------------------------------------------------------------+
       |                                          br0                           |
       +------------------------------------------------------------------------+
       AP 2
      
      - br0 of AP 1 will know that Clients A and B are reachable via wlan0
      - the hardware fdb of a DSA switch driver today is not kept in sync with
        the software entries on other bridge ports, so it will not know that
        clients A and B are reachable via the CPU port UNLESS the hardware
        switch itself performs SA learning from traffic injected from the CPU.
        Nonetheless, a substantial number of switches don't.
      - the hardware fdb of the DSA switch on AP 2 may autonomously learn that
        Client A and B are reachable through swp0. Therefore, the software br0
        of AP 2 also may or may not learn this. In the example we're
        illustrating, some Ethernet traffic has been going on, and br0 from AP
        2 has indeed learnt that it can reach Client B through swp0.
      
      One of the wireless clients, say Client B, disconnects from AP 1 and
      roams to AP 2. The topology now looks like this:
      
       AP 1:
       +------------------------------------------------------------------------+
       |                                          br0                           |
       +------------------------------------------------------------------------+
       +------------+ +------------+ +------------+ +------------+ +------------+
       |    swp0    | |    swp1    | |    swp2    | |    swp3    | |    wlan0   |
       +------------+ +------------+ +------------+ +------------+ +------------+
             |                                                            ^
             |                                                            |
             |                                                         Client A
             |
             |
             |                                                         Client B
             |                                                            |
             |                                                            v
       +------------+ +------------+ +------------+ +------------+ +------------+
       |    swp0    | |    swp1    | |    swp2    | |    swp3    | |    wlan0   |
       +------------+ +------------+ +------------+ +------------+ +------------+
       +------------------------------------------------------------------------+
       |                                          br0                           |
       +------------------------------------------------------------------------+
       AP 2
      
      - br0 of AP 1 still knows that Client A is reachable via wlan0 (no change)
      - br0 of AP 1 will (possibly) know that Client B has left wlan0. There
        are cases where it might never find out though. Either way, DSA today
        does not process that notification in any way.
      - the hardware FDB of the DSA switch on AP 1 may learn autonomously that
        Client B can be reached via swp0, if it receives any packet with
        Client 1's source MAC address over Ethernet.
      - the hardware FDB of the DSA switch on AP 2 still thinks that Client B
        can be reached via swp0. It does not know that it has roamed to wlan0,
        because it doesn't perform SA learning from the CPU port.
      
      Now Client A contacts Client B.
      AP 1 routes the packet fine towards swp0 and delivers it on the Ethernet
      segment.
      AP 2 sees a frame on swp0 and its fdb says that the destination is swp0.
      Hairpinning is disabled => drop.
      
      This problem comes from the fact that these switches have a 'blind spot'
      for addresses coming from software bridging. The generic solution is not
      to assume that hardware learning can be enabled somehow, but to listen
      to more bridge learning events. It turns out that the bridge driver does
      learn in software from all inbound frames, in __br_handle_local_finish.
      A proper SWITCHDEV_FDB_ADD_TO_DEVICE notification is emitted for the
      addresses serviced by the bridge on 'foreign' interfaces. The software
      bridge also does the right thing on migration, by notifying that the old
      entry is deleted, so that does not need to be special-cased in DSA. When
      it is deleted, we just need to delete our static FDB entry towards the
      CPU too, and wait.
      
      The problem is that DSA currently only cares about SWITCHDEV_FDB_ADD_TO_DEVICE
      events received on its own interfaces, such as static FDB entries.
      
      Luckily we can change that, and DSA can listen to all switchdev FDB
      add/del events in the system and figure out if those events were emitted
      by a bridge that spans at least one of DSA's own ports. In case that is
      true, DSA will also offload that address towards its own CPU port, in
      the eventuality that there might be bridge clients attached to the DSA
      switch who want to talk to the station connected to the foreign
      interface.
      
      In terms of implementation, we need to keep the fdb_info->added_by_user
      check for the case where the switchdev event was targeted directly at a
      DSA switch port. But we don't need to look at that flag for snooped
      events. So the check is currently too late, we need to move it earlier.
      This also simplifies the code a bit, since we avoid uselessly allocating
      and freeing switchdev_work.
      
      We could probably do some improvements in the future. For example,
      multi-bridge support is rudimentary at the moment. If there are two
      bridges spanning a DSA switch's ports, and both of them need to service
      the same MAC address, then what will happen is that the migration of one
      of those stations will trigger the deletion of the FDB entry from the
      CPU port while it is still used by other bridge. That could be improved
      with reference counting but is left for another time.
      
      This behavior needs to be enabled at driver level by setting
      ds->assisted_learning_on_cpu_port = true. This is because we don't want
      to inflict a potential performance penalty (accesses through
      MDIO/I2C/SPI are expensive) to hardware that really doesn't need it
      because address learning on the CPU port works there.
      Reported-by: NDENG Qingfang <dqfext@gmail.com>
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      d5f19486
    • V
      net: dsa: exit early in dsa_slave_switchdev_event if we can't program the FDB · 5fb4a451
      Vladimir Oltean 提交于
      Right now, the following would happen for a switch driver that does not
      implement .port_fdb_add or .port_fdb_del.
      
      dsa_slave_switchdev_event returns NOTIFY_OK and schedules:
      -> dsa_slave_switchdev_event_work
         -> dsa_port_fdb_add
            -> dsa_port_notify(DSA_NOTIFIER_FDB_ADD)
               -> dsa_switch_fdb_add
                  -> if (!ds->ops->port_fdb_add) return -EOPNOTSUPP;
         -> an error is printed with dev_dbg, and
            dsa_fdb_offload_notify(switchdev_work) is not called.
      
      We can avoid scheduling the worker for nothing and say NOTIFY_DONE.
      Because we don't call dsa_fdb_offload_notify, the static FDB entry will
      remain just in the software bridge.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      5fb4a451
    • V
      net: dsa: move switchdev event implementation under the same switch/case statement · 447d290a
      Vladimir Oltean 提交于
      We'll need to start listening to SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE
      events even for interfaces where dsa_slave_dev_check returns false, so
      we need that check inside the switch-case statement for SWITCHDEV_FDB_*.
      
      This movement also avoids a useless allocation / free of switchdev_work
      on the untreated "default event" case.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      447d290a
    • V
      net: dsa: don't use switchdev_notifier_fdb_info in dsa_switchdev_event_work · c4bb76a9
      Vladimir Oltean 提交于
      Currently DSA doesn't add FDB entries on the CPU port, because it only
      does so through switchdev, which is associated with a net_device, and
      there are none of those for the CPU port.
      
      But actually FDB addresses on the CPU port have some use cases of their
      own, if the switchdev operations are initiated from within the DSA
      layer. There is just one problem with the existing code: it passes a
      structure in dsa_switchdev_event_work which was retrieved directly from
      switchdev, so it contains a net_device. We need to generalize the
      contents to something that covers the CPU port as well: the "ds, port"
      tuple is fine for that.
      
      Note that the new procedure for notifying the successful FDB offload is
      inspired from the rocker model.
      
      Also, nothing was being done if added_by_user was false. Let's check for
      that a lot earlier, and don't actually bother to schedule the worker
      for nothing.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      c4bb76a9
    • V
      net: dsa: be louder when a non-legacy FDB operation fails · 2fd18650
      Vladimir Oltean 提交于
      The dev_close() call was added in commit c9eb3e0f ("net: dsa: Add
      support for learning FDB through notification") "to indicate inconsistent
      situation" when we could not delete an FDB entry from the port.
      
      bridge fdb del d8:58:d7:00:ca:6d dev swp0 self master
      
      It is a bit drastic and at the same time not helpful if the above fails
      to only print with netdev_dbg log level, but on the other hand to bring
      the interface down.
      
      So increase the verbosity of the error message, and drop dev_close().
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      2fd18650
    • V
      net: bridge: notify switchdev of disappearance of old FDB entry upon migration · 90dc8fd3
      Vladimir Oltean 提交于
      Currently the bridge emits atomic switchdev notifications for
      dynamically learnt FDB entries. Monitoring these notifications works
      wonders for switchdev drivers that want to keep their hardware FDB in
      sync with the bridge's FDB.
      
      For example station A wants to talk to station B in the diagram below,
      and we are concerned with the behavior of the bridge on the DUT device:
      
                         DUT
       +-------------------------------------+
       |                 br0                 |
       | +------+ +------+ +------+ +------+ |
       | |      | |      | |      | |      | |
       | | swp0 | | swp1 | | swp2 | | eth0 | |
       +-------------------------------------+
            |        |                  |
        Station A    |                  |
                     |                  |
               +--+------+--+    +--+------+--+
               |  |      |  |    |  |      |  |
               |  | swp0 |  |    |  | swp0 |  |
       Another |  +------+  |    |  +------+  | Another
        switch |     br0    |    |     br0    | switch
               |  +------+  |    |  +------+  |
               |  |      |  |    |  |      |  |
               |  | swp1 |  |    |  | swp1 |  |
               +--+------+--+    +--+------+--+
                                        |
                                    Station B
      
      Interfaces swp0, swp1, swp2 are handled by a switchdev driver that has
      the following property: frames injected from its control interface bypass
      the internal address analyzer logic, and therefore, this hardware does
      not learn from the source address of packets transmitted by the network
      stack through it. So, since bridging between eth0 (where Station B is
      attached) and swp0 (where Station A is attached) is done in software,
      the switchdev hardware will never learn the source address of Station B.
      So the traffic towards that destination will be treated as unknown, i.e.
      flooded.
      
      This is where the bridge notifications come in handy. When br0 on the
      DUT sees frames with Station B's MAC address on eth0, the switchdev
      driver gets these notifications and can install a rule to send frames
      towards Station B's address that are incoming from swp0, swp1, swp2,
      only towards the control interface. This is all switchdev driver private
      business, which the notification makes possible.
      
      All is fine until someone unplugs Station B's cable and moves it to the
      other switch:
      
                         DUT
       +-------------------------------------+
       |                 br0                 |
       | +------+ +------+ +------+ +------+ |
       | |      | |      | |      | |      | |
       | | swp0 | | swp1 | | swp2 | | eth0 | |
       +-------------------------------------+
            |        |                  |
        Station A    |                  |
                     |                  |
               +--+------+--+    +--+------+--+
               |  |      |  |    |  |      |  |
               |  | swp0 |  |    |  | swp0 |  |
       Another |  +------+  |    |  +------+  | Another
        switch |     br0    |    |     br0    | switch
               |  +------+  |    |  +------+  |
               |  |      |  |    |  |      |  |
               |  | swp1 |  |    |  | swp1 |  |
               +--+------+--+    +--+------+--+
                     |
                 Station B
      
      Luckily for the use cases we care about, Station B is noisy enough that
      the DUT hears it (on swp1 this time). swp1 receives the frames and
      delivers them to the bridge, who enters the unlikely path in br_fdb_update
      of updating an existing entry. It moves the entry in the software bridge
      to swp1 and emits an addition notification towards that.
      
      As far as the switchdev driver is concerned, all that it needs to ensure
      is that traffic between Station A and Station B is not forever broken.
      If it does nothing, then the stale rule to send frames for Station B
      towards the control interface remains in place. But Station B is no
      longer reachable via the control interface, but via a port that can
      offload the bridge port learning attribute. It's just that the port is
      prevented from learning this address, since the rule overrides FDB
      updates. So the rule needs to go. The question is via what mechanism.
      
      It sure would be possible for this switchdev driver to keep track of all
      addresses which are sent to the control interface, and then also listen
      for bridge notifier events on its own ports, searching for the ones that
      have a MAC address which was previously sent to the control interface.
      But this is cumbersome and inefficient. Instead, with one small change,
      the bridge could notify of the address deletion from the old port, in a
      symmetrical manner with how it did for the insertion. Then the switchdev
      driver would not be required to monitor learn/forget events for its own
      ports. It could just delete the rule towards the control interface upon
      bridge entry migration. This would make hardware address learning be
      possible again. Then it would take a few more packets until the hardware
      and software FDB would be in sync again.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Acked-by: NNikolay Aleksandrov <nikolay@nvidia.com>
      Reviewed-by: NIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      90dc8fd3
    • F
      net: ip: always refragment ip defragmented packets · bb4cc1a1
      Florian Westphal 提交于
      Conntrack reassembly records the largest fragment size seen in IPCB.
      However, when this gets forwarded/transmitted, fragmentation will only
      be forced if one of the fragmented packets had the DF bit set.
      
      In that case, a flag in IPCB will force fragmentation even if the
      MTU is large enough.
      
      This should work fine, but this breaks with ip tunnels.
      Consider client that sends a UDP datagram of size X to another host.
      
      The client fragments the datagram, so two packets, of size y and z, are
      sent. DF bit is not set on any of these packets.
      
      Middlebox netfilter reassembles those packets back to single size-X
      packet, before routing decision.
      
      packet-size-vs-mtu checks in ip_forward are irrelevant, because DF bit
      isn't set.  At output time, ip refragmentation is skipped as well
      because x is still smaller than the mtu of the output device.
      
      If ttransmit device is an ip tunnel, the packet size increases to
      x+overhead.
      
      Also, tunnel might be configured to force DF bit on outer header.
      
      In this case, packet will be dropped (exceeds MTU) and an ICMP error is
      generated back to sender.
      
      But sender already respects the announced MTU, all the packets that
      it sent did fit the announced mtu.
      
      Force refragmentation as per original sizes unconditionally so ip tunnel
      will encapsulate the fragments instead.
      
      The only other solution I see is to place ip refragmentation in
      the ip_tunnel code to handle this case.
      
      Fixes: d6b915e2 ("ip_fragment: don't forward defragmented DF packet")
      Reported-by: NChristian Perle <christian.perle@secunet.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Acked-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      bb4cc1a1
    • F
      net: fix pmtu check in nopmtudisc mode · 50c66167
      Florian Westphal 提交于
      For some reason ip_tunnel insist on setting the DF bit anyway when the
      inner header has the DF bit set, EVEN if the tunnel was configured with
      'nopmtudisc'.
      
      This means that the script added in the previous commit
      cannot be made to work by adding the 'nopmtudisc' flag to the
      ip tunnel configuration. Doing so breaks connectivity even for the
      without-conntrack/netfilter scenario.
      
      When nopmtudisc is set, the tunnel will skip the mtu check, so no
      icmp error is sent to client. Then, because inner header has DF set,
      the outer header gets added with DF bit set as well.
      
      IP stack then sends an error to itself because the packet exceeds
      the device MTU.
      
      Fixes: 23a3647b ("ip_tunnels: Use skb-len to PMTU check.")
      Cc: Stefano Brivio <sbrivio@redhat.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Acked-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      50c66167
    • J
      udp_tunnel: reshuffle NETIF_F_RX_UDP_TUNNEL_PORT checks · b9ef3fec
      Jakub Kicinski 提交于
      Move the NETIF_F_RX_UDP_TUNNEL_PORT feature check into
      udp_tunnel_nic_*_port() helpers, since they're always
      done right before the call.
      
      Add similar checks before calling the notifier.
      udp_tunnel_nic invokes the notifier without checking
      features which could result in some wasted cycles.
      Reviewed-by: NAlexander Duyck <alexanderduyck@fb.com>
      Reviewed-by: NJacob Keller <jacob.e.keller@intel.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      b9ef3fec
    • J
      udp_tunnel: hard-wire NDOs to udp_tunnel_nic_*_port() helpers · 876c4384
      Jakub Kicinski 提交于
      All drivers use udp_tunnel_nic_*_port() helpers, prepare for
      NDO removal by invoking those helpers directly.
      
      The helpers are safe to call on all devices, they check if
      device has the UDP tunnel state initialized.
      Reviewed-by: NAlexander Duyck <alexanderduyck@fb.com>
      Reviewed-by: NJacob Keller <jacob.e.keller@intel.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      876c4384
    • S
      net: ipv6: fib: flush exceptions when purging route · d8f5c296
      Sean Tranchetti 提交于
      Route removal is handled by two code paths. The main removal path is via
      fib6_del_route() which will handle purging any PMTU exceptions from the
      cache, removing all per-cpu copies of the DST entry used by the route, and
      releasing the fib6_info struct.
      
      The second removal location is during fib6_add_rt2node() during a route
      replacement operation. This path also calls fib6_purge_rt() to handle
      cleaning up the per-cpu copies of the DST entries and releasing the
      fib6_info associated with the older route, but it does not flush any PMTU
      exceptions that the older route had. Since the older route is removed from
      the tree during the replacement, we lose any way of accessing it again.
      
      As these lingering DSTs and the fib6_info struct are holding references to
      the underlying netdevice struct as well, unregistering that device from the
      kernel can never complete.
      
      Fixes: 2b760fcf ("ipv6: hook up exception table to store dst cache")
      Signed-off-by: NSean Tranchetti <stranche@codeaurora.org>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Link: https://lore.kernel.org/r/1609892546-11389-1-git-send-email-stranche@quicinc.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      d8f5c296
  5. 07 1月, 2021 1 次提交
  6. 06 1月, 2021 1 次提交