1. 16 1月, 2021 4 次提交
    • V
      net: dsa: add ops for devlink-sb · 2a6ef763
      Vladimir Oltean 提交于
      Switches that care about QoS might have hardware support for reserving
      buffer pools for individual ports or traffic classes, and configuring
      their sizes and thresholds. Through devlink-sb (shared buffers), this is
      all configurable, as well as their occupancy being viewable.
      
      Add the plumbing in DSA for these operations.
      
      Individual drivers still need to call devlink_sb_register() with the
      shared buffers they want to expose. A helper was not created in DSA for
      this purpose (unlike, say, dsa_devlink_params_register), since in my
      opinion it does not bring any benefit over plainly calling
      devlink_sb_register() directly.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      2a6ef763
    • T
      neighbor: remove definition of DEBUG · e794e7fa
      Tom Rix 提交于
      Defining DEBUG should only be done in development.
      So remove DEBUG.
      Signed-off-by: NTom Rix <trix@redhat.com>
      Link: https://lore.kernel.org/r/20210114212917.48174-1-trix@redhat.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      e794e7fa
    • V
      net: dsa: set configure_vlan_while_not_filtering to true by default · 0ee2af4e
      Vladimir Oltean 提交于
      As explained in commit 54a0ed0d ("net: dsa: provide an option for
      drivers to always receive bridge VLANs"), DSA has historically been
      skipping VLAN switchdev operations when the bridge wasn't in
      vlan_filtering mode, but the reason why it was doing that has never been
      clear. So the configure_vlan_while_not_filtering option is there merely
      to preserve functionality for existing drivers. It isn't some behavior
      that drivers should opt into. Ideally, when all drivers leave this flag
      set, we can delete the dsa_port_skip_vlan_configuration() function.
      
      New drivers always seem to omit setting this flag, for some reason. So
      let's reverse the logic: the DSA core sets it by default to true before
      the .setup() callback, and legacy drivers can turn it off. This way, new
      drivers get the new behavior by default, unless they explicitly set the
      flag to false, which is more obvious during review.
      
      Remove the assignment from drivers which were setting it to true, and
      add the assignment to false for the drivers that didn't previously have
      it. This way, it should be easier to see how many we have left.
      
      The following drivers: lan9303, mv88e6060 were skipped from setting this
      flag to false, because they didn't have any VLAN offload ops in the
      first place.
      
      The Broadcom Starfighter 2 driver calls the common b53_switch_alloc and
      therefore also inherits the configure_vlan_while_not_filtering=true
      behavior.
      
      Also, print a message through netlink extack every time a VLAN has been
      skipped. This is mildly annoying on purpose, so that (a) it is at least
      clear that VLANs are being skipped - the legacy behavior in itself is
      confusing, and the extack should be much more difficult to miss, unlike
      kernel logs - and (b) people have one more incentive to convert to the
      new behavior.
      
      No behavior change except for the added prints is intended at this time.
      
      $ ip link add br0 type bridge vlan_filtering 0
      $ ip link set sw0p2 master br0
      [   60.315148] br0: port 1(sw0p2) entered blocking state
      [   60.320350] br0: port 1(sw0p2) entered disabled state
      [   60.327839] device sw0p2 entered promiscuous mode
      [   60.334905] br0: port 1(sw0p2) entered blocking state
      [   60.340142] br0: port 1(sw0p2) entered forwarding state
      Warning: dsa_core: skipping configuration of VLAN. # This was the pvid
      $ bridge vlan add dev sw0p2 vid 100
      Warning: dsa_core: skipping configuration of VLAN.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NKurt Kanzenbach <kurt@linutronix.de>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Link: https://lore.kernel.org/r/20210115231919.43834-1-vladimir.oltean@nxp.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      0ee2af4e
    • G
      dsa: add support for Arrow XRS700x tag trailer · 54a52823
      George McCollister 提交于
      Add support for Arrow SpeedChips XRS700x single byte tag trailer. This
      is modeled on tag_trailer.c which works in a similar way.
      Signed-off-by: NGeorge McCollister <george.mccollister@gmail.com>
      Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Reviewed-by: NVladimir Oltean <olteanv@gmail.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      54a52823
  2. 15 1月, 2021 11 次提交
  3. 14 1月, 2021 5 次提交
  4. 13 1月, 2021 10 次提交
    • G
      net/smc: use memcpy instead of snprintf to avoid out of bounds read · 8a446536
      Guvenc Gulce 提交于
      Using snprintf() to convert not null-terminated strings to null
      terminated strings may cause out of bounds read in the source string.
      Therefore use memcpy() and terminate the target string with a null
      afterwards.
      
      Fixes: a3db10ef ("net/smc: Add support for obtaining SMCR device list")
      Signed-off-by: NGuvenc Gulce <guvenc@linux.ibm.com>
      Signed-off-by: NKarsten Graul <kgraul@linux.ibm.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      8a446536
    • J
      smc: fix out of bound access in smc_nl_get_sys_info() · 25fe2c9c
      Jakub Kicinski 提交于
      smc_clc_get_hostname() sets the host pointer to a buffer
      which is not NULL-terminated (see smc_clc_init()).
      
      Reported-by: syzbot+f4708c391121cfc58396@syzkaller.appspotmail.com
      Fixes: 099b990b ("net/smc: Add support for obtaining system information")
      Signed-off-by: NKarsten Graul <kgraul@linux.ibm.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      25fe2c9c
    • Y
      hci: llc_shdlc: style: Simplify bool comparison · f50e2f9f
      YANG LI 提交于
      Fix the following coccicheck warning:
      ./net/nfc/hci/llc_shdlc.c:239:5-21: WARNING: Comparison to bool
      Signed-off-by: NYANG LI <abaci-bugfix@linux.alibaba.com>
      Reported-by: Abaci Robot<abaci@linux.alibaba.com>
      Link: https://lore.kernel.org/r/1610357063-57705-1-git-send-email-abaci-bugfix@linux.alibaba.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      f50e2f9f
    • O
      net: dsa: add optional stats64 support · c2ec5f2e
      Oleksij Rempel 提交于
      Allow DSA drivers to export stats64
      Signed-off-by: NOleksij Rempel <o.rempel@pengutronix.de>
      Reviewed-by: NVladimir Oltean <olteanv@gmail.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      c2ec5f2e
    • P
      mptcp: better msk-level shutdown. · 76e2a55d
      Paolo Abeni 提交于
      Instead of re-implementing most of inet_shutdown, re-use
      such helper, and implement the MPTCP-specific bits at the
      'proto' level.
      
      The msk-level disconnect() can now be invoked, lets provide a
      suitable implementation.
      
      As a side effect, this fixes bad state management for listener
      sockets. The latter could lead to division by 0 oops since
      commit ea4ca586 ("mptcp: refine MPTCP-level ack scheduling").
      
      Fixes: 43b54c6e ("mptcp: Use full MPTCP-level disconnect state machine")
      Fixes: ea4ca586 ("mptcp: refine MPTCP-level ack scheduling")
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: NMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      76e2a55d
    • P
      mptcp: more strict state checking for acks · 20bc80b6
      Paolo Abeni 提交于
      Syzkaller found a way to trigger division by zero
      in mptcp_subflow_cleanup_rbuf().
      
      The current checks implemented into tcp_can_send_ack()
      are too week, let's be more accurate.
      Reported-by: NChristoph Paasch <cpaasch@apple.com>
      Fixes: ea4ca586 ("mptcp: refine MPTCP-level ack scheduling")
      Fixes: fd897679 ("mptcp: be careful on MPTCP-level ack.")
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: NMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      20bc80b6
    • V
      net: dsa: clear devlink port type before unregistering slave netdevs · 91158e16
      Vladimir Oltean 提交于
      Florian reported a use-after-free bug in devlink_nl_port_fill found with
      KASAN:
      
      (devlink_nl_port_fill)
      (devlink_port_notify)
      (devlink_port_unregister)
      (dsa_switch_teardown.part.3)
      (dsa_tree_teardown_switches)
      (dsa_unregister_switch)
      (bcm_sf2_sw_remove)
      (platform_remove)
      (device_release_driver_internal)
      (device_links_unbind_consumers)
      (device_release_driver_internal)
      (device_driver_detach)
      (unbind_store)
      
      Allocated by task 31:
       alloc_netdev_mqs+0x5c/0x50c
       dsa_slave_create+0x110/0x9c8
       dsa_register_switch+0xdb0/0x13a4
       b53_switch_register+0x47c/0x6dc
       bcm_sf2_sw_probe+0xaa4/0xc98
       platform_probe+0x90/0xf4
       really_probe+0x184/0x728
       driver_probe_device+0xa4/0x278
       __device_attach_driver+0xe8/0x148
       bus_for_each_drv+0x108/0x158
      
      Freed by task 249:
       free_netdev+0x170/0x194
       dsa_slave_destroy+0xac/0xb0
       dsa_port_teardown.part.2+0xa0/0xb4
       dsa_tree_teardown_switches+0x50/0xc4
       dsa_unregister_switch+0x124/0x250
       bcm_sf2_sw_remove+0x98/0x13c
       platform_remove+0x44/0x5c
       device_release_driver_internal+0x150/0x254
       device_links_unbind_consumers+0xf8/0x12c
       device_release_driver_internal+0x84/0x254
       device_driver_detach+0x30/0x34
       unbind_store+0x90/0x134
      
      What happens is that devlink_port_unregister emits a netlink
      DEVLINK_CMD_PORT_DEL message which associates the devlink port that is
      getting unregistered with the ifindex of its corresponding net_device.
      Only trouble is, the net_device has already been unregistered.
      
      It looks like we can stub out the search for a corresponding net_device
      if we clear the devlink_port's type. This looks like a bit of a hack,
      but also seems to be the reason why the devlink_port_type_clear function
      exists in the first place.
      
      Fixes: 3122433e ("net: dsa: Register devlink ports before calling DSA driver setup()")
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Tested-by: NFlorian fainelli <f.fainelli@gmail.com>
      Reported-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Link: https://lore.kernel.org/r/20210112004831.3778323-1-olteanv@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      91158e16
    • V
      net: dsa: unbind all switches from tree when DSA master unbinds · 07b90056
      Vladimir Oltean 提交于
      Currently the following happens when a DSA master driver unbinds while
      there are DSA switches attached to it:
      
      $ echo 0000:00:00.5 > /sys/bus/pci/drivers/mscc_felix/unbind
      ------------[ cut here ]------------
      WARNING: CPU: 0 PID: 392 at net/core/dev.c:9507
      Call trace:
       rollback_registered_many+0x5fc/0x688
       unregister_netdevice_queue+0x98/0x120
       dsa_slave_destroy+0x4c/0x88
       dsa_port_teardown.part.16+0x78/0xb0
       dsa_tree_teardown_switches+0x58/0xc0
       dsa_unregister_switch+0x104/0x1b8
       felix_pci_remove+0x24/0x48
       pci_device_remove+0x48/0xf0
       device_release_driver_internal+0x118/0x1e8
       device_driver_detach+0x28/0x38
       unbind_store+0xd0/0x100
      
      Located at the above location is this WARN_ON:
      
      	/* Notifier chain MUST detach us all upper devices. */
      	WARN_ON(netdev_has_any_upper_dev(dev));
      
      Other stacked interfaces, like VLAN, do indeed listen for
      NETDEV_UNREGISTER on the real_dev and also unregister themselves at that
      time, which is clearly the behavior that rollback_registered_many
      expects. But DSA interfaces are not VLAN. They have backing hardware
      (platform devices, PCI devices, MDIO, SPI etc) which have a life cycle
      of their own and we can't just trigger an unregister from the DSA
      framework when we receive a netdev notifier that the master unregisters.
      
      Luckily, there is something we can do, and that is to inform the driver
      core that we have a runtime dependency to the DSA master interface's
      device, and create a device link where that is the supplier and we are
      the consumer. Having this device link will make the DSA switch unbind
      before the DSA master unbinds, which is enough to avoid the WARN_ON from
      rollback_registered_many.
      
      Note that even before the blamed commit, DSA did nothing intelligent
      when the master interface got unregistered either. See the discussion
      here:
      https://lore.kernel.org/netdev/20200505210253.20311-1-f.fainelli@gmail.com/
      But this time, at least the WARN_ON is loud enough that the
      upper_dev_link commit can be blamed.
      
      The advantage with this approach vs dev_hold(master) in the attached
      link is that the latter is not meant for long term reference counting.
      With dev_hold, the only thing that will happen is that when the user
      attempts an unbind of the DSA master, netdev_wait_allrefs will keep
      waiting and waiting, due to DSA keeping the refcount forever. DSA would
      not access freed memory corresponding to the master interface, but the
      unbind would still result in a freeze. Whereas with device links,
      graceful teardown is ensured. It even works with cascaded DSA trees.
      
      $ echo 0000:00:00.2 > /sys/bus/pci/drivers/fsl_enetc/unbind
      [ 1818.797546] device swp0 left promiscuous mode
      [ 1819.301112] sja1105 spi2.0: Link is Down
      [ 1819.307981] DSA: tree 1 torn down
      [ 1819.312408] device eno2 left promiscuous mode
      [ 1819.656803] mscc_felix 0000:00:00.5: Link is Down
      [ 1819.667194] DSA: tree 0 torn down
      [ 1819.711557] fsl_enetc 0000:00:00.2 eno2: Link is Down
      
      This approach allows us to keep the DSA framework absolutely unchanged,
      and the driver core will just know to unbind us first when the master
      goes away - as opposed to the large (and probably impossible) rework
      required if attempting to listen for NETDEV_UNREGISTER.
      
      As per the documentation at Documentation/driver-api/device_link.rst,
      specifying the DL_FLAG_AUTOREMOVE_CONSUMER flag causes the device link
      to be automatically purged when the consumer fails to probe or later
      unbinds. So we don't need to keep the consumer_link variable in struct
      dsa_switch.
      
      Fixes: 2f1e8ea7 ("net: dsa: link interfaces with the DSA master to get rid of lockdep warnings")
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Tested-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Link: https://lore.kernel.org/r/20210111230943.3701806-1-olteanv@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      07b90056
    • P
      net: dcb: Accept RTM_GETDCB messages carrying set-like DCB commands · df85bc14
      Petr Machata 提交于
      In commit 826f328e ("net: dcb: Validate netlink message in DCB
      handler"), Linux started rejecting RTM_GETDCB netlink messages if they
      contained a set-like DCB_CMD_ command.
      
      The reason was that privileges were only verified for RTM_SETDCB messages,
      but the value that determined the action to be taken is the command, not
      the message type. And validation of message type against the DCB command
      was the obvious missing piece.
      
      Unfortunately it turns out that mlnx_qos, a somewhat widely deployed tool
      for configuration of DCB, accesses the DCB set-like APIs through
      RTM_GETDCB.
      
      Therefore do not bounce the discrepancy between message type and command.
      Instead, in addition to validating privileges based on the actual message
      type, validate them also based on the expected message type. This closes
      the loophole of allowing DCB configuration on non-admin accounts, while
      maintaining backward compatibility.
      
      Fixes: 2f90b865 ("ixgbe: this patch adds support for DCB to the kernel and ixgbe driver")
      Fixes: 826f328e ("net: dcb: Validate netlink message in DCB handler")
      Signed-off-by: NPetr Machata <petrm@nvidia.com>
      Link: https://lore.kernel.org/r/a3edcfda0825f2aa2591801c5232f2bbf2d8a554.1610384801.git.me@pmachata.orgSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      df85bc14
    • D
      bpf: Allow to retrieve sol_socket opts from sock_addr progs · bcd6f4a8
      Daniel Borkmann 提交于
      The _bpf_setsockopt() is able to set some of the SOL_SOCKET level options,
      however, _bpf_getsockopt() has little support to actually retrieve them.
      This small patch adds few misc options such as SO_MARK, SO_PRIORITY and
      SO_BINDTOIFINDEX. For the latter getter and setter are added. The mark and
      priority in particular allow to retrieve the options from BPF cgroup hooks
      to then implement custom behavior / settings on the syscall hooks compared
      to other sockets that stick to the defaults, for example.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NYonghong Song <yhs@fb.com>
      Link: https://lore.kernel.org/bpf/cba44439b801e5ddc1170e5be787f4dc93a2d7f9.1610406333.git.daniel@iogearbox.net
      bcd6f4a8
  5. 12 1月, 2021 10 次提交
    • W
      esp: avoid unneeded kmap_atomic call · 9bd6b629
      Willem de Bruijn 提交于
      esp(6)_output_head uses skb_page_frag_refill to allocate a buffer for
      the esp trailer.
      
      It accesses the page with kmap_atomic to handle highmem. But
      skb_page_frag_refill can return compound pages, of which
      kmap_atomic only maps the first underlying page.
      
      skb_page_frag_refill does not return highmem, because flag
      __GFP_HIGHMEM is not set. ESP uses it in the same manner as TCP.
      That also does not call kmap_atomic, but directly uses page_address,
      in skb_copy_to_page_nocache. Do the same for ESP.
      
      This issue has become easier to trigger with recent kmap local
      debugging feature CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP.
      
      Fixes: cac2661c ("esp4: Avoid skb_cow_data whenever possible")
      Fixes: 03e2a30f ("esp6: Avoid skb_cow_data whenever possible")
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Acked-by: NSteffen Klassert <steffen.klassert@secunet.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      9bd6b629
    • W
      net: compound page support in skb_seq_read · 97550f6f
      Willem de Bruijn 提交于
      skb_seq_read iterates over an skb, returning pointer and length of
      the next data range with each call.
      
      It relies on kmap_atomic to access highmem pages when needed.
      
      An skb frag may be backed by a compound page, but kmap_atomic maps
      only a single page. There are not enough kmap slots to always map all
      pages concurrently.
      
      Instead, if kmap_atomic is needed, iterate over each page.
      
      As this increases the number of calls, avoid this unless needed.
      The necessary condition is captured in skb_frag_must_loop.
      
      I tried to make the change as obvious as possible. It should be easy
      to verify that nothing changes if skb_frag_must_loop returns false.
      
      Tested:
        On an x86 platform with
          CONFIG_HIGHMEM=y
          CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP=y
          CONFIG_NETFILTER_XT_MATCH_STRING=y
      
        Run
          ip link set dev lo mtu 1500
          iptables -A OUTPUT -m string --string 'badstring' -algo bm -j ACCEPT
          dd if=/dev/urandom of=in bs=1M count=20
          nc -l -p 8000 > /dev/null &
          nc -w 1 -q 0 localhost 8000 < in
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      97550f6f
    • V
      net: dsa: remove obsolete comments about switchdev transactions · 417b99bf
      Vladimir Oltean 提交于
      Now that all port object notifiers were converted to be non-transactional,
      we can remove the comments that say otherwise.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Acked-by: NLinus Walleij <linus.walleij@linaro.org>
      Acked-by: NJiri Pirko <jiri@nvidia.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      417b99bf
    • V
      net: dsa: remove the transactional logic from VLAN objects · 1958d581
      Vladimir Oltean 提交于
      It should be the driver's business to logically separate its VLAN
      offloading into a preparation and a commit phase, and some drivers don't
      need / can't do this.
      
      So remove the transactional shim from DSA and let drivers propagate
      errors directly from the .port_vlan_add callback.
      
      It would appear that the code has worse error handling now than it had
      before. DSA is the only in-kernel user of switchdev that offloads one
      switchdev object to more than one port: for every VLAN object offloaded
      to a user port, that VLAN is also offloaded to the CPU port. So the
      "prepare for user port -> check for errors -> prepare for CPU port ->
      check for errors -> commit for user port -> commit for CPU port"
      sequence appears to make more sense than the one we are using now:
      "offload to user port -> check for errors -> offload to CPU port ->
      check for errors", but it is really a compromise. In the new way, we can
      catch errors from the commit phase that we previously had to ignore.
      But we have our hands tied and cannot do any rollback now: if we add a
      VLAN on the CPU port and it fails, we can't do the rollback by simply
      deleting it from the user port, because the switchdev API is not so nice
      with us: it could have simply been there already, even with the same
      flags. So we don't even attempt to rollback anything on addition error,
      just leave whatever VLANs managed to get offloaded right where they are.
      This should not be a problem at all in practice.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Acked-by: NLinus Walleij <linus.walleij@linaro.org>
      Acked-by: NJiri Pirko <jiri@nvidia.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      1958d581
    • V
      net: dsa: remove the transactional logic from MDB entries · a52b2da7
      Vladimir Oltean 提交于
      For many drivers, the .port_mdb_prepare callback was not a good opportunity
      to avoid any error condition, and they would suppress errors found during
      the actual commit phase.
      
      Where a logical separation between the prepare and the commit phase
      existed, the function that used to implement the .port_mdb_prepare
      callback still exists, but now it is called directly from .port_mdb_add,
      which was modified to return an int code.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Acked-by: NLinus Walleij <linus.walleij@linaro.org>
      Acked-by: NJiri Pirko <jiri@nvidia.com>
      Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> # hellcreek
      Reviewed-by: Linus Wallei <linus.walleij@linaro.org> # RTL8366
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      a52b2da7
    • V
      net: dsa: remove the transactional logic from ageing time notifiers · 77b61365
      Vladimir Oltean 提交于
      Remove the shim introduced in DSA for offloading the bridge ageing time
      from switchdev, by first checking whether the ageing time is within the
      range limits requested by the driver.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Acked-by: NLinus Walleij <linus.walleij@linaro.org>
      Acked-by: NJiri Pirko <jiri@nvidia.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      77b61365
    • V
      net: switchdev: remove the transaction structure from port attributes · bae33f2b
      Vladimir Oltean 提交于
      Since the introduction of the switchdev API, port attributes were
      transmitted to drivers for offloading using a two-step transactional
      model, with a prepare phase that was supposed to catch all errors, and a
      commit phase that was supposed to never fail.
      
      Some classes of failures can never be avoided, like hardware access, or
      memory allocation. In the latter case, merely attempting to move the
      memory allocation to the preparation phase makes it impossible to avoid
      memory leaks, since commit 91cf8ece ("switchdev: Remove unused
      transaction item queue") which has removed the unused mechanism of
      passing on the allocated memory between one phase and another.
      
      It is time we admit that separating the preparation from the commit
      phase is something that is best left for the driver to decide, and not
      something that should be baked into the API, especially since there are
      no switchdev callers that depend on this.
      
      This patch removes the struct switchdev_trans member from switchdev port
      attribute notifier structures, and converts drivers to not look at this
      member.
      
      In part, this patch contains a revert of my previous commit 2e554a7a
      ("net: dsa: propagate switchdev vlan_filtering prepare phase to
      drivers").
      
      For the most part, the conversion was trivial except for:
      - Rocker's world implementation based on Broadcom OF-DPA had an odd
        implementation of ofdpa_port_attr_bridge_flags_set. The conversion was
        done mechanically, by pasting the implementation twice, then only
        keeping the code that would get executed during prepare phase on top,
        then only keeping the code that gets executed during the commit phase
        on bottom, then simplifying the resulting code until this was obtained.
      - DSA's offloading of STP state, bridge flags, VLAN filtering and
        multicast router could be converted right away. But the ageing time
        could not, so a shim was introduced and this was left for a further
        commit.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Acked-by: NLinus Walleij <linus.walleij@linaro.org>
      Acked-by: NJiri Pirko <jiri@nvidia.com>
      Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> # hellcreek
      Reviewed-by: Linus Walleij <linus.walleij@linaro.org> # RTL8366RB
      Reviewed-by: NIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      bae33f2b
    • V
      net: switchdev: delete switchdev_port_obj_add_now · cf6def51
      Vladimir Oltean 提交于
      After the removal of the transactional model inside
      switchdev_port_obj_add_now, it has no added value and we can just call
      switchdev_port_obj_notify directly, bypassing this function. Let's
      delete it.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Reviewed-by: NIdo Schimmel <idosch@nvidia.com>
      Acked-by: NLinus Walleij <linus.walleij@linaro.org>
      Acked-by: NJiri Pirko <jiri@nvidia.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      cf6def51
    • V
      net: switchdev: remove the transaction structure from port object notifiers · ffb68fc5
      Vladimir Oltean 提交于
      Since the introduction of the switchdev API, port objects were
      transmitted to drivers for offloading using a two-step transactional
      model, with a prepare phase that was supposed to catch all errors, and a
      commit phase that was supposed to never fail.
      
      Some classes of failures can never be avoided, like hardware access, or
      memory allocation. In the latter case, merely attempting to move the
      memory allocation to the preparation phase makes it impossible to avoid
      memory leaks, since commit 91cf8ece ("switchdev: Remove unused
      transaction item queue") which has removed the unused mechanism of
      passing on the allocated memory between one phase and another.
      
      It is time we admit that separating the preparation from the commit
      phase is something that is best left for the driver to decide, and not
      something that should be baked into the API, especially since there are
      no switchdev callers that depend on this.
      
      This patch removes the struct switchdev_trans member from switchdev port
      object notifier structures, and converts drivers to not look at this
      member.
      
      Where driver conversion is trivial (like in the case of the Marvell
      Prestera driver, NXP DPAA2 switch, TI CPSW, and Rocker drivers), it is
      done in this patch.
      
      Where driver conversion needs more attention (DSA, Mellanox Spectrum),
      the conversion is left for subsequent patches and here we only fake the
      prepare/commit phases at a lower level, just not in the switchdev
      notifier itself.
      
      Where the code has a natural structure that is best left alone as a
      preparation and a commit phase (as in the case of the Ocelot switch),
      that structure is left in place, just made to not depend upon the
      switchdev transactional model.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Acked-by: NLinus Walleij <linus.walleij@linaro.org>
      Acked-by: NJiri Pirko <jiri@nvidia.com>
      Reviewed-by: NIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      ffb68fc5
    • V
      net: switchdev: remove vid_begin -> vid_end range from VLAN objects · b7a9e0da
      Vladimir Oltean 提交于
      The call path of a switchdev VLAN addition to the bridge looks something
      like this today:
      
              nbp_vlan_init
              |  __br_vlan_set_default_pvid
              |  |                       |
              |  |    br_afspec          |
              |  |        |              |
              |  |        v              |
              |  | br_process_vlan_info  |
              |  |        |              |
              |  |        v              |
              |  |   br_vlan_info        |
              |  |       / \            /
              |  |      /   \          /
              |  |     /     \        /
              |  |    /       \      /
              v  v   v         v    v
            nbp_vlan_add   br_vlan_add ------+
             |              ^      ^ |       |
             |             /       | |       |
             |            /       /  /       |
             \ br_vlan_get_master/  /        v
              \        ^        /  /  br_vlan_add_existing
               \       |       /  /          |
                \      |      /  /          /
                 \     |     /  /          /
                  \    |    /  /          /
                   \   |   /  /          /
                    v  |   | v          /
                    __vlan_add         /
                       / |            /
                      /  |           /
                     v   |          /
         __vlan_vid_add  |         /
                     \   |        /
                      v  v        v
            br_switchdev_port_vlan_add
      
      The ranges UAPI was introduced to the bridge in commit bdced7ef
      ("bridge: support for multiple vlans and vlan ranges in setlink and
      dellink requests") (Jan 10 2015). But the VLAN ranges (parsed in br_afspec)
      have always been passed one by one, through struct bridge_vlan_info
      tmp_vinfo, to br_vlan_info. So the range never went too far in depth.
      
      Then Scott Feldman introduced the switchdev_port_bridge_setlink function
      in commit 47f8328b ("switchdev: add new switchdev bridge setlink").
      That marked the introduction of the SWITCHDEV_OBJ_PORT_VLAN, which made
      full use of the range. But switchdev_port_bridge_setlink was called like
      this:
      
      br_setlink
      -> br_afspec
      -> switchdev_port_bridge_setlink
      
      Basically, the switchdev and the bridge code were not tightly integrated.
      Then commit 41c498b9 ("bridge: restore br_setlink back to original")
      came, and switchdev drivers were required to implement
      .ndo_bridge_setlink = switchdev_port_bridge_setlink for a while.
      
      In the meantime, commits such as 0944d6b5 ("bridge: try switchdev op
      first in __vlan_vid_add/del") finally made switchdev penetrate the
      br_vlan_info() barrier and start to develop the call path we have today.
      But remember, br_vlan_info() still receives VLANs one by one.
      
      Then Arkadi Sharshevsky refactored the switchdev API in 2017 in commit
      29ab586c ("net: switchdev: Remove bridge bypass support from
      switchdev") so that drivers would not implement .ndo_bridge_setlink any
      longer. The switchdev_port_bridge_setlink also got deleted.
      This refactoring removed the parallel bridge_setlink implementation from
      switchdev, and left the only switchdev VLAN objects to be the ones
      offloaded from __vlan_vid_add (basically RX filtering) and  __vlan_add
      (the latter coming from commit 9c86ce2c ("net: bridge: Notify about
      bridge VLANs")).
      
      That is to say, today the switchdev VLAN object ranges are not used in
      the kernel. Refactoring the above call path is a bit complicated, when
      the bridge VLAN call path is already a bit complicated.
      
      Let's go off and finish the job of commit 29ab586c by deleting the
      bogus iteration through the VLAN ranges from the drivers. Some aspects
      of this feature never made too much sense in the first place. For
      example, what is a range of VLANs all having the BRIDGE_VLAN_INFO_PVID
      flag supposed to mean, when a port can obviously have a single pvid?
      This particular configuration _is_ denied as of commit 6623c60d
      ("bridge: vlan: enforce no pvid flag in vlan ranges"), but from an API
      perspective, the driver still has to play pretend, and only offload the
      vlan->vid_end as pvid. And the addition of a switchdev VLAN object can
      modify the flags of another, completely unrelated, switchdev VLAN
      object! (a VLAN that is PVID will invalidate the PVID flag from whatever
      other VLAN had previously been offloaded with switchdev and had that
      flag. Yet switchdev never notifies about that change, drivers are
      supposed to guess).
      
      Nonetheless, having a VLAN range in the API makes error handling look
      scarier than it really is - unwinding on errors and all of that.
      When in reality, no one really calls this API with more than one VLAN.
      It is all unnecessary complexity.
      
      And despite appearing pretentious (two-phase transactional model and
      all), the switchdev API is really sloppy because the VLAN addition and
      removal operations are not paired with one another (you can add a VLAN
      100 times and delete it just once). The bridge notifies through
      switchdev of a VLAN addition not only when the flags of an existing VLAN
      change, but also when nothing changes. There are switchdev drivers out
      there who don't like adding a VLAN that has already been added, and
      those checks don't really belong at driver level. But the fact that the
      API contains ranges is yet another factor that prevents this from being
      addressed in the future.
      
      Of the existing switchdev pieces of hardware, it appears that only
      Mellanox Spectrum supports offloading more than one VLAN at a time,
      through mlxsw_sp_port_vlan_set. I have kept that code internal to the
      driver, because there is some more bookkeeping that makes use of it, but
      I deleted it from the switchdev API. But since the switchdev support for
      ranges has already been de facto deleted by a Mellanox employee and
      nobody noticed for 4 years, I'm going to assume it's not a biggie.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: Ido Schimmel <idosch@nvidia.com> # switchdev and mlxsw
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> # hellcreek
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      b7a9e0da