1. 21 1月, 2021 2 次提交
  2. 20 1月, 2021 1 次提交
  3. 19 1月, 2021 5 次提交
    • T
      net/bonding: Declare TLS RX device offload support · dc5809f9
      Tariq Toukan 提交于
      Following the description in previous patch (for TX):
      As the bond interface is being bypassed by the TLS module, interacting
      directly against the lower devs, there is no way for the bond interface
      to disable its device offload capabilities, as long as the mode/policy
      config allows it.
      Hence, the feature flag is not directly controllable, but just reflects
      the offload status based on the logic under bond_sk_check().
      
      Here we just declare RX device offload support, and expose it via the
      NETIF_F_HW_TLS_RX flag.
      Signed-off-by: NTariq Toukan <tariqt@nvidia.com>
      Reviewed-by: NBoris Pismenny <borisp@nvidia.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      dc5809f9
    • T
      net/bonding: Implement TLS TX device offload · 89df6a81
      Tariq Toukan 提交于
      Implement TLS TX device offload for bonding interfaces.
      This allows kTLS sockets running on a bond to benefit from the
      device offload on capable lower devices.
      
      To allow a simple and fast maintenance of the TLS context in SW and
      lower devices, we bind the TLS socket to a specific lower dev.
      To achieve a behavior similar to SW kTLS, we support only balance-xor
      and 802.3ad modes, with xmit_hash_policy=layer3+4. This is enforced
      in bond_sk_check(), done in a previous patch.
      
      For the above configuration, the SW implementation keeps picking the
      same exact lower dev for all the socket's SKBs. The device offload
      behaves similarly, making the decision once at the connection creation.
      
      Per socket, the TLS module should work directly with the lowest netdev
      in chain, to call the tls_dev_ops operations.
      
      As the bond interface is being bypassed by the TLS module, interacting
      directly against the lower devs, there is no way for the bond interface
      to disable its device offload capabilities, as long as the mode/policy
      config allows it.
      Hence, the feature flag is not directly controllable, but just reflects
      the current offload status based on the logic under bond_sk_check().
      Signed-off-by: NTariq Toukan <tariqt@nvidia.com>
      Reviewed-by: NBoris Pismenny <borisp@nvidia.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      89df6a81
    • T
      net/bonding: Implement ndo_sk_get_lower_dev · 007feb87
      Tariq Toukan 提交于
      Add ndo_sk_get_lower_dev() implementation for bond interfaces.
      
      Support only for the cases where the socket's and SKBs' hash
      yields identical value for the whole connection lifetime.
      
      Here we restrict it to L3+4 sockets only, with
      xmit_hash_policy==LAYER34 and bond modes xor/802.3ad.
      Signed-off-by: NTariq Toukan <tariqt@nvidia.com>
      Reviewed-by: NBoris Pismenny <borisp@nvidia.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      007feb87
    • C
      net_sched: fix RTNL deadlock again caused by request_module() · d349f997
      Cong Wang 提交于
      tcf_action_init_1() loads tc action modules automatically with
      request_module() after parsing the tc action names, and it drops RTNL
      lock and re-holds it before and after request_module(). This causes a
      lot of troubles, as discovered by syzbot, because we can be in the
      middle of batch initializations when we create an array of tc actions.
      
      One of the problem is deadlock:
      
      CPU 0					CPU 1
      rtnl_lock();
      for (...) {
        tcf_action_init_1();
          -> rtnl_unlock();
          -> request_module();
      				rtnl_lock();
      				for (...) {
      				  tcf_action_init_1();
      				    -> tcf_idr_check_alloc();
      				   // Insert one action into idr,
      				   // but it is not committed until
      				   // tcf_idr_insert_many(), then drop
      				   // the RTNL lock in the _next_
      				   // iteration
      				   -> rtnl_unlock();
          -> rtnl_lock();
          -> a_o->init();
            -> tcf_idr_check_alloc();
            // Now waiting for the same index
            // to be committed
      				    -> request_module();
      				    -> rtnl_lock()
      				    // Now waiting for RTNL lock
      				}
      				rtnl_unlock();
      }
      rtnl_unlock();
      
      This is not easy to solve, we can move the request_module() before
      this loop and pre-load all the modules we need for this netlink
      message and then do the rest initializations. So the loop breaks down
      to two now:
      
              for (i = 1; i <= TCA_ACT_MAX_PRIO && tb[i]; i++) {
                      struct tc_action_ops *a_o;
      
                      a_o = tc_action_load_ops(name, tb[i]...);
                      ops[i - 1] = a_o;
              }
      
              for (i = 1; i <= TCA_ACT_MAX_PRIO && tb[i]; i++) {
                      act = tcf_action_init_1(ops[i - 1]...);
              }
      
      Although this looks serious, it only has been reported by syzbot, so it
      seems hard to trigger this by humans. And given the size of this patch,
      I'd suggest to make it to net-next and not to backport to stable.
      
      This patch has been tested by syzbot and tested with tdc.py by me.
      
      Fixes: 0fedc63f ("net_sched: commit action insertions together")
      Reported-and-tested-by: syzbot+82752bc5331601cf4899@syzkaller.appspotmail.com
      Reported-and-tested-by: syzbot+b3b63b6bff456bd95294@syzkaller.appspotmail.com
      Reported-by: syzbot+ba67b12b1ca729912834@syzkaller.appspotmail.com
      Cc: Jiri Pirko <jiri@resnulli.us>
      Signed-off-by: NCong Wang <cong.wang@bytedance.com>
      Tested-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Acked-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Link: https://lore.kernel.org/r/20210117005657.14810-1-xiyou.wangcong@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      d349f997
    • E
      tcp: fix TCP_USER_TIMEOUT with zero window · 9d9b1ee0
      Enke Chen 提交于
      The TCP session does not terminate with TCP_USER_TIMEOUT when data
      remain untransmitted due to zero window.
      
      The number of unanswered zero-window probes (tcp_probes_out) is
      reset to zero with incoming acks irrespective of the window size,
      as described in tcp_probe_timer():
      
          RFC 1122 4.2.2.17 requires the sender to stay open indefinitely
          as long as the receiver continues to respond probes. We support
          this by default and reset icsk_probes_out with incoming ACKs.
      
      This counter, however, is the wrong one to be used in calculating the
      duration that the window remains closed and data remain untransmitted.
      Thanks to Jonathan Maxwell <jmaxwell37@gmail.com> for diagnosing the
      actual issue.
      
      In this patch a new timestamp is introduced for the socket in order to
      track the elapsed time for the zero-window probes that have not been
      answered with any non-zero window ack.
      
      Fixes: 9721e709 ("tcp: simplify window probe aborting on USER_TIMEOUT")
      Reported-by: NWilliam McCall <william.mccall@gmail.com>
      Co-developed-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NEnke Chen <enchen@paloaltonetworks.com>
      Reviewed-by: NYuchung Cheng <ycheng@google.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20210115223058.GA39267@localhost.localdomainSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      9d9b1ee0
  4. 16 1月, 2021 2 次提交
  5. 15 1月, 2021 2 次提交
  6. 13 1月, 2021 1 次提交
  7. 12 1月, 2021 6 次提交
    • V
      net: switchdev: delete the transaction object · 8f73cc50
      Vladimir Oltean 提交于
      Now that all users of struct switchdev_trans have been modified to do
      without it, we can remove this structure and the two helpers to determine
      the phase.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Reviewed-by: NIdo Schimmel <idosch@nvidia.com>
      Acked-by: NLinus Walleij <linus.walleij@linaro.org>
      Acked-by: NJiri Pirko <jiri@nvidia.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      8f73cc50
    • V
      net: dsa: remove the transactional logic from VLAN objects · 1958d581
      Vladimir Oltean 提交于
      It should be the driver's business to logically separate its VLAN
      offloading into a preparation and a commit phase, and some drivers don't
      need / can't do this.
      
      So remove the transactional shim from DSA and let drivers propagate
      errors directly from the .port_vlan_add callback.
      
      It would appear that the code has worse error handling now than it had
      before. DSA is the only in-kernel user of switchdev that offloads one
      switchdev object to more than one port: for every VLAN object offloaded
      to a user port, that VLAN is also offloaded to the CPU port. So the
      "prepare for user port -> check for errors -> prepare for CPU port ->
      check for errors -> commit for user port -> commit for CPU port"
      sequence appears to make more sense than the one we are using now:
      "offload to user port -> check for errors -> offload to CPU port ->
      check for errors", but it is really a compromise. In the new way, we can
      catch errors from the commit phase that we previously had to ignore.
      But we have our hands tied and cannot do any rollback now: if we add a
      VLAN on the CPU port and it fails, we can't do the rollback by simply
      deleting it from the user port, because the switchdev API is not so nice
      with us: it could have simply been there already, even with the same
      flags. So we don't even attempt to rollback anything on addition error,
      just leave whatever VLANs managed to get offloaded right where they are.
      This should not be a problem at all in practice.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Acked-by: NLinus Walleij <linus.walleij@linaro.org>
      Acked-by: NJiri Pirko <jiri@nvidia.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      1958d581
    • V
      net: dsa: remove the transactional logic from MDB entries · a52b2da7
      Vladimir Oltean 提交于
      For many drivers, the .port_mdb_prepare callback was not a good opportunity
      to avoid any error condition, and they would suppress errors found during
      the actual commit phase.
      
      Where a logical separation between the prepare and the commit phase
      existed, the function that used to implement the .port_mdb_prepare
      callback still exists, but now it is called directly from .port_mdb_add,
      which was modified to return an int code.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Acked-by: NLinus Walleij <linus.walleij@linaro.org>
      Acked-by: NJiri Pirko <jiri@nvidia.com>
      Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> # hellcreek
      Reviewed-by: Linus Wallei <linus.walleij@linaro.org> # RTL8366
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      a52b2da7
    • V
      net: switchdev: remove the transaction structure from port attributes · bae33f2b
      Vladimir Oltean 提交于
      Since the introduction of the switchdev API, port attributes were
      transmitted to drivers for offloading using a two-step transactional
      model, with a prepare phase that was supposed to catch all errors, and a
      commit phase that was supposed to never fail.
      
      Some classes of failures can never be avoided, like hardware access, or
      memory allocation. In the latter case, merely attempting to move the
      memory allocation to the preparation phase makes it impossible to avoid
      memory leaks, since commit 91cf8ece ("switchdev: Remove unused
      transaction item queue") which has removed the unused mechanism of
      passing on the allocated memory between one phase and another.
      
      It is time we admit that separating the preparation from the commit
      phase is something that is best left for the driver to decide, and not
      something that should be baked into the API, especially since there are
      no switchdev callers that depend on this.
      
      This patch removes the struct switchdev_trans member from switchdev port
      attribute notifier structures, and converts drivers to not look at this
      member.
      
      In part, this patch contains a revert of my previous commit 2e554a7a
      ("net: dsa: propagate switchdev vlan_filtering prepare phase to
      drivers").
      
      For the most part, the conversion was trivial except for:
      - Rocker's world implementation based on Broadcom OF-DPA had an odd
        implementation of ofdpa_port_attr_bridge_flags_set. The conversion was
        done mechanically, by pasting the implementation twice, then only
        keeping the code that would get executed during prepare phase on top,
        then only keeping the code that gets executed during the commit phase
        on bottom, then simplifying the resulting code until this was obtained.
      - DSA's offloading of STP state, bridge flags, VLAN filtering and
        multicast router could be converted right away. But the ageing time
        could not, so a shim was introduced and this was left for a further
        commit.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Acked-by: NLinus Walleij <linus.walleij@linaro.org>
      Acked-by: NJiri Pirko <jiri@nvidia.com>
      Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> # hellcreek
      Reviewed-by: Linus Walleij <linus.walleij@linaro.org> # RTL8366RB
      Reviewed-by: NIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      bae33f2b
    • V
      net: switchdev: remove the transaction structure from port object notifiers · ffb68fc5
      Vladimir Oltean 提交于
      Since the introduction of the switchdev API, port objects were
      transmitted to drivers for offloading using a two-step transactional
      model, with a prepare phase that was supposed to catch all errors, and a
      commit phase that was supposed to never fail.
      
      Some classes of failures can never be avoided, like hardware access, or
      memory allocation. In the latter case, merely attempting to move the
      memory allocation to the preparation phase makes it impossible to avoid
      memory leaks, since commit 91cf8ece ("switchdev: Remove unused
      transaction item queue") which has removed the unused mechanism of
      passing on the allocated memory between one phase and another.
      
      It is time we admit that separating the preparation from the commit
      phase is something that is best left for the driver to decide, and not
      something that should be baked into the API, especially since there are
      no switchdev callers that depend on this.
      
      This patch removes the struct switchdev_trans member from switchdev port
      object notifier structures, and converts drivers to not look at this
      member.
      
      Where driver conversion is trivial (like in the case of the Marvell
      Prestera driver, NXP DPAA2 switch, TI CPSW, and Rocker drivers), it is
      done in this patch.
      
      Where driver conversion needs more attention (DSA, Mellanox Spectrum),
      the conversion is left for subsequent patches and here we only fake the
      prepare/commit phases at a lower level, just not in the switchdev
      notifier itself.
      
      Where the code has a natural structure that is best left alone as a
      preparation and a commit phase (as in the case of the Ocelot switch),
      that structure is left in place, just made to not depend upon the
      switchdev transactional model.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Acked-by: NLinus Walleij <linus.walleij@linaro.org>
      Acked-by: NJiri Pirko <jiri@nvidia.com>
      Reviewed-by: NIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      ffb68fc5
    • V
      net: switchdev: remove vid_begin -> vid_end range from VLAN objects · b7a9e0da
      Vladimir Oltean 提交于
      The call path of a switchdev VLAN addition to the bridge looks something
      like this today:
      
              nbp_vlan_init
              |  __br_vlan_set_default_pvid
              |  |                       |
              |  |    br_afspec          |
              |  |        |              |
              |  |        v              |
              |  | br_process_vlan_info  |
              |  |        |              |
              |  |        v              |
              |  |   br_vlan_info        |
              |  |       / \            /
              |  |      /   \          /
              |  |     /     \        /
              |  |    /       \      /
              v  v   v         v    v
            nbp_vlan_add   br_vlan_add ------+
             |              ^      ^ |       |
             |             /       | |       |
             |            /       /  /       |
             \ br_vlan_get_master/  /        v
              \        ^        /  /  br_vlan_add_existing
               \       |       /  /          |
                \      |      /  /          /
                 \     |     /  /          /
                  \    |    /  /          /
                   \   |   /  /          /
                    v  |   | v          /
                    __vlan_add         /
                       / |            /
                      /  |           /
                     v   |          /
         __vlan_vid_add  |         /
                     \   |        /
                      v  v        v
            br_switchdev_port_vlan_add
      
      The ranges UAPI was introduced to the bridge in commit bdced7ef
      ("bridge: support for multiple vlans and vlan ranges in setlink and
      dellink requests") (Jan 10 2015). But the VLAN ranges (parsed in br_afspec)
      have always been passed one by one, through struct bridge_vlan_info
      tmp_vinfo, to br_vlan_info. So the range never went too far in depth.
      
      Then Scott Feldman introduced the switchdev_port_bridge_setlink function
      in commit 47f8328b ("switchdev: add new switchdev bridge setlink").
      That marked the introduction of the SWITCHDEV_OBJ_PORT_VLAN, which made
      full use of the range. But switchdev_port_bridge_setlink was called like
      this:
      
      br_setlink
      -> br_afspec
      -> switchdev_port_bridge_setlink
      
      Basically, the switchdev and the bridge code were not tightly integrated.
      Then commit 41c498b9 ("bridge: restore br_setlink back to original")
      came, and switchdev drivers were required to implement
      .ndo_bridge_setlink = switchdev_port_bridge_setlink for a while.
      
      In the meantime, commits such as 0944d6b5 ("bridge: try switchdev op
      first in __vlan_vid_add/del") finally made switchdev penetrate the
      br_vlan_info() barrier and start to develop the call path we have today.
      But remember, br_vlan_info() still receives VLANs one by one.
      
      Then Arkadi Sharshevsky refactored the switchdev API in 2017 in commit
      29ab586c ("net: switchdev: Remove bridge bypass support from
      switchdev") so that drivers would not implement .ndo_bridge_setlink any
      longer. The switchdev_port_bridge_setlink also got deleted.
      This refactoring removed the parallel bridge_setlink implementation from
      switchdev, and left the only switchdev VLAN objects to be the ones
      offloaded from __vlan_vid_add (basically RX filtering) and  __vlan_add
      (the latter coming from commit 9c86ce2c ("net: bridge: Notify about
      bridge VLANs")).
      
      That is to say, today the switchdev VLAN object ranges are not used in
      the kernel. Refactoring the above call path is a bit complicated, when
      the bridge VLAN call path is already a bit complicated.
      
      Let's go off and finish the job of commit 29ab586c by deleting the
      bogus iteration through the VLAN ranges from the drivers. Some aspects
      of this feature never made too much sense in the first place. For
      example, what is a range of VLANs all having the BRIDGE_VLAN_INFO_PVID
      flag supposed to mean, when a port can obviously have a single pvid?
      This particular configuration _is_ denied as of commit 6623c60d
      ("bridge: vlan: enforce no pvid flag in vlan ranges"), but from an API
      perspective, the driver still has to play pretend, and only offload the
      vlan->vid_end as pvid. And the addition of a switchdev VLAN object can
      modify the flags of another, completely unrelated, switchdev VLAN
      object! (a VLAN that is PVID will invalidate the PVID flag from whatever
      other VLAN had previously been offloaded with switchdev and had that
      flag. Yet switchdev never notifies about that change, drivers are
      supposed to guess).
      
      Nonetheless, having a VLAN range in the API makes error handling look
      scarier than it really is - unwinding on errors and all of that.
      When in reality, no one really calls this API with more than one VLAN.
      It is all unnecessary complexity.
      
      And despite appearing pretentious (two-phase transactional model and
      all), the switchdev API is really sloppy because the VLAN addition and
      removal operations are not paired with one another (you can add a VLAN
      100 times and delete it just once). The bridge notifies through
      switchdev of a VLAN addition not only when the flags of an existing VLAN
      change, but also when nothing changes. There are switchdev drivers out
      there who don't like adding a VLAN that has already been added, and
      those checks don't really belong at driver level. But the fact that the
      API contains ranges is yet another factor that prevents this from being
      addressed in the future.
      
      Of the existing switchdev pieces of hardware, it appears that only
      Mellanox Spectrum supports offloading more than one VLAN at a time,
      through mlxsw_sp_port_vlan_set. I have kept that code internal to the
      driver, because there is some more bookkeeping that makes use of it, but
      I deleted it from the switchdev API. But since the switchdev support for
      ranges has already been de facto deleted by a Mellanox employee and
      nobody noticed for 4 years, I'm going to assume it's not a biggie.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: Ido Schimmel <idosch@nvidia.com> # switchdev and mlxsw
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> # hellcreek
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      b7a9e0da
  8. 09 1月, 2021 3 次提交
  9. 08 1月, 2021 5 次提交
    • V
      net: dsa: remove the DSA specific notifiers · 1dbb1302
      Vladimir Oltean 提交于
      This effectively reverts commit 60724d4b ("net: dsa: Add support for
      DSA specific notifiers"). The reason is that since commit 2f1e8ea7
      ("net: dsa: link interfaces with the DSA master to get rid of lockdep
      warnings"), it appears that there is a generic way to achieve the same
      purpose. The only user thus far, the Broadcom SYSTEMPORT driver, was
      converted to use the generic notifiers.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Acked-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      1dbb1302
    • V
      net: dsa: export dsa_slave_dev_check · a5e3c9ba
      Vladimir Oltean 提交于
      Using the NETDEV_CHANGEUPPER notifications, drivers can be aware when
      they are enslaved to e.g. a bridge by calling netif_is_bridge_master().
      
      Export this helper from DSA to get the equivalent functionality of
      determining whether the upper interface of a CHANGEUPPER notifier is a
      DSA switch interface or not.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Acked-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      a5e3c9ba
    • V
      net: dsa: move the Broadcom tag information in a separate header file · f46b9b8e
      Vladimir Oltean 提交于
      It is a bit strange to see something as specific as Broadcom SYSTEMPORT
      bits in the main DSA include file. Move these away into a separate
      header, and have the tagger and the SYSTEMPORT driver include them.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Acked-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      f46b9b8e
    • V
      net: dsa: listen for SWITCHDEV_{FDB,DEL}_ADD_TO_DEVICE on foreign bridge neighbors · d5f19486
      Vladimir Oltean 提交于
      Some DSA switches (and not only) cannot learn source MAC addresses from
      packets injected from the CPU. They only perform hardware address
      learning from inbound traffic.
      
      This can be problematic when we have a bridge spanning some DSA switch
      ports and some non-DSA ports (which we'll call "foreign interfaces" from
      DSA's perspective).
      
      There are 2 classes of problems created by the lack of learning on
      CPU-injected traffic:
      - excessive flooding, due to the fact that DSA treats those addresses as
        unknown
      - the risk of stale routes, which can lead to temporary packet loss
      
      To illustrate the second class, consider the following situation, which
      is common in production equipment (wireless access points, where there
      is a WLAN interface and an Ethernet switch, and these form a single
      bridging domain).
      
       AP 1:
       +------------------------------------------------------------------------+
       |                                          br0                           |
       +------------------------------------------------------------------------+
       +------------+ +------------+ +------------+ +------------+ +------------+
       |    swp0    | |    swp1    | |    swp2    | |    swp3    | |    wlan0   |
       +------------+ +------------+ +------------+ +------------+ +------------+
             |                                                       ^        ^
             |                                                       |        |
             |                                                       |        |
             |                                                    Client A  Client B
             |
             |
             |
       +------------+ +------------+ +------------+ +------------+ +------------+
       |    swp0    | |    swp1    | |    swp2    | |    swp3    | |    wlan0   |
       +------------+ +------------+ +------------+ +------------+ +------------+
       +------------------------------------------------------------------------+
       |                                          br0                           |
       +------------------------------------------------------------------------+
       AP 2
      
      - br0 of AP 1 will know that Clients A and B are reachable via wlan0
      - the hardware fdb of a DSA switch driver today is not kept in sync with
        the software entries on other bridge ports, so it will not know that
        clients A and B are reachable via the CPU port UNLESS the hardware
        switch itself performs SA learning from traffic injected from the CPU.
        Nonetheless, a substantial number of switches don't.
      - the hardware fdb of the DSA switch on AP 2 may autonomously learn that
        Client A and B are reachable through swp0. Therefore, the software br0
        of AP 2 also may or may not learn this. In the example we're
        illustrating, some Ethernet traffic has been going on, and br0 from AP
        2 has indeed learnt that it can reach Client B through swp0.
      
      One of the wireless clients, say Client B, disconnects from AP 1 and
      roams to AP 2. The topology now looks like this:
      
       AP 1:
       +------------------------------------------------------------------------+
       |                                          br0                           |
       +------------------------------------------------------------------------+
       +------------+ +------------+ +------------+ +------------+ +------------+
       |    swp0    | |    swp1    | |    swp2    | |    swp3    | |    wlan0   |
       +------------+ +------------+ +------------+ +------------+ +------------+
             |                                                            ^
             |                                                            |
             |                                                         Client A
             |
             |
             |                                                         Client B
             |                                                            |
             |                                                            v
       +------------+ +------------+ +------------+ +------------+ +------------+
       |    swp0    | |    swp1    | |    swp2    | |    swp3    | |    wlan0   |
       +------------+ +------------+ +------------+ +------------+ +------------+
       +------------------------------------------------------------------------+
       |                                          br0                           |
       +------------------------------------------------------------------------+
       AP 2
      
      - br0 of AP 1 still knows that Client A is reachable via wlan0 (no change)
      - br0 of AP 1 will (possibly) know that Client B has left wlan0. There
        are cases where it might never find out though. Either way, DSA today
        does not process that notification in any way.
      - the hardware FDB of the DSA switch on AP 1 may learn autonomously that
        Client B can be reached via swp0, if it receives any packet with
        Client 1's source MAC address over Ethernet.
      - the hardware FDB of the DSA switch on AP 2 still thinks that Client B
        can be reached via swp0. It does not know that it has roamed to wlan0,
        because it doesn't perform SA learning from the CPU port.
      
      Now Client A contacts Client B.
      AP 1 routes the packet fine towards swp0 and delivers it on the Ethernet
      segment.
      AP 2 sees a frame on swp0 and its fdb says that the destination is swp0.
      Hairpinning is disabled => drop.
      
      This problem comes from the fact that these switches have a 'blind spot'
      for addresses coming from software bridging. The generic solution is not
      to assume that hardware learning can be enabled somehow, but to listen
      to more bridge learning events. It turns out that the bridge driver does
      learn in software from all inbound frames, in __br_handle_local_finish.
      A proper SWITCHDEV_FDB_ADD_TO_DEVICE notification is emitted for the
      addresses serviced by the bridge on 'foreign' interfaces. The software
      bridge also does the right thing on migration, by notifying that the old
      entry is deleted, so that does not need to be special-cased in DSA. When
      it is deleted, we just need to delete our static FDB entry towards the
      CPU too, and wait.
      
      The problem is that DSA currently only cares about SWITCHDEV_FDB_ADD_TO_DEVICE
      events received on its own interfaces, such as static FDB entries.
      
      Luckily we can change that, and DSA can listen to all switchdev FDB
      add/del events in the system and figure out if those events were emitted
      by a bridge that spans at least one of DSA's own ports. In case that is
      true, DSA will also offload that address towards its own CPU port, in
      the eventuality that there might be bridge clients attached to the DSA
      switch who want to talk to the station connected to the foreign
      interface.
      
      In terms of implementation, we need to keep the fdb_info->added_by_user
      check for the case where the switchdev event was targeted directly at a
      DSA switch port. But we don't need to look at that flag for snooped
      events. So the check is currently too late, we need to move it earlier.
      This also simplifies the code a bit, since we avoid uselessly allocating
      and freeing switchdev_work.
      
      We could probably do some improvements in the future. For example,
      multi-bridge support is rudimentary at the moment. If there are two
      bridges spanning a DSA switch's ports, and both of them need to service
      the same MAC address, then what will happen is that the migration of one
      of those stations will trigger the deletion of the FDB entry from the
      CPU port while it is still used by other bridge. That could be improved
      with reference counting but is left for another time.
      
      This behavior needs to be enabled at driver level by setting
      ds->assisted_learning_on_cpu_port = true. This is because we don't want
      to inflict a potential performance penalty (accesses through
      MDIO/I2C/SPI are expensive) to hardware that really doesn't need it
      because address learning on the CPU port works there.
      Reported-by: NDENG Qingfang <dqfext@gmail.com>
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      d5f19486
    • J
      udp_tunnel: reshuffle NETIF_F_RX_UDP_TUNNEL_PORT checks · b9ef3fec
      Jakub Kicinski 提交于
      Move the NETIF_F_RX_UDP_TUNNEL_PORT feature check into
      udp_tunnel_nic_*_port() helpers, since they're always
      done right before the call.
      
      Add similar checks before calling the notifier.
      udp_tunnel_nic invokes the notifier without checking
      features which could result in some wasted cycles.
      Reviewed-by: NAlexander Duyck <alexanderduyck@fb.com>
      Reviewed-by: NJacob Keller <jacob.e.keller@intel.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      b9ef3fec
  10. 29 12月, 2020 1 次提交
    • R
      net: sched: prevent invalid Scell_log shift count · bd1248f1
      Randy Dunlap 提交于
      Check Scell_log shift size in red_check_params() and modify all callers
      of red_check_params() to pass Scell_log.
      
      This prevents a shift out-of-bounds as detected by UBSAN:
        UBSAN: shift-out-of-bounds in ./include/net/red.h:252:22
        shift exponent 72 is too large for 32-bit type 'int'
      
      Fixes: 8afa10cb ("net_sched: red: Avoid illegal values")
      Signed-off-by: NRandy Dunlap <rdunlap@infradead.org>
      Reported-by: syzbot+97c5bd9cc81eca63d36e@syzkaller.appspotmail.com
      Cc: Nogah Frankel <nogahf@mellanox.com>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Jiri Pirko <jiri@resnulli.us>
      Cc: netdev@vger.kernel.org
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Jakub Kicinski <kuba@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bd1248f1
  11. 18 12月, 2020 1 次提交
  12. 15 12月, 2020 3 次提交
  13. 13 12月, 2020 2 次提交
    • S
      inet: frags: batch fqdir destroy works · 0b9b2414
      SeongJae Park 提交于
      On a few of our systems, I found frequent 'unshare(CLONE_NEWNET)' calls
      make the number of active slab objects including 'sock_inode_cache' type
      rapidly and continuously increase.  As a result, memory pressure occurs.
      
      In more detail, I made an artificial reproducer that resembles the
      workload that we found the problem and reproduce the problem faster.  It
      merely repeats 'unshare(CLONE_NEWNET)' 50,000 times in a loop.  It takes
      about 2 minutes.  On 40 CPU cores / 70GB DRAM machine, the available
      memory continuously reduced in a fast speed (about 120MB per second,
      15GB in total within the 2 minutes).  Note that the issue don't
      reproduce on every machine.  On my 6 CPU cores machine, the problem
      didn't reproduce.
      
      'cleanup_net()' and 'fqdir_work_fn()' are functions that deallocate the
      relevant memory objects.  They are asynchronously invoked by the work
      queues and internally use 'rcu_barrier()' to ensure safe destructions.
      'cleanup_net()' works in a batched maneer in a single thread worker,
      while 'fqdir_work_fn()' works for each 'fqdir_exit()' call in the
      'system_wq'.  Therefore, 'fqdir_work_fn()' called frequently under the
      workload and made the contention for 'rcu_barrier()' high.  In more
      detail, the global mutex, 'rcu_state.barrier_mutex' became the
      bottleneck.
      
      This commit avoids such contention by doing the 'rcu_barrier()' and
      subsequent lightweight works in a batched manner, as similar to that of
      'cleanup_net()'.  The fqdir hashtable destruction, which is done before
      the 'rcu_barrier()', is still allowed to run in parallel for fast
      processing, but this commit makes it to use a dedicated work queue
      instead of the 'system_wq', to make sure that the number of threads is
      bounded.
      Signed-off-by: NSeongJae Park <sjpark@amazon.de>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20201211112405.31158-1-sjpark@amazon.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      0b9b2414
    • P
      netfilter: nftables: generalize set extension to support for several expressions · 563125a7
      Pablo Neira Ayuso 提交于
      This patch replaces NFT_SET_EXPR by NFT_SET_EXT_EXPRESSIONS. This new
      extension allows to attach several expressions to one set element (not
      only one single expression as NFT_SET_EXPR provides). This patch
      prepares for support for several expressions per set element in the
      netlink userspace API.
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      563125a7
  14. 12 12月, 2020 3 次提交
  15. 11 12月, 2020 3 次提交