1. 26 1月, 2021 7 次提交
  2. 24 1月, 2021 15 次提交
  3. 23 1月, 2021 18 次提交
    • J
      Merge branch 'mlxsw-expose-number-of-physical-ports' · 59a49d96
      Jakub Kicinski 提交于
      Ido Schimmel says:
      
      ====================
      mlxsw: Expose number of physical ports
      
      The switch ASIC has a limited capacity of physical ports that it can
      support. While each system is brought up with a different number of
      ports, this number can be increased via splitting up to the ASIC's
      limit.
      
      Expose physical ports as a devlink resource so that user space will have
      visibility into the maximum number of ports that can be supported and
      the current occupancy. With this resource it is possible, for example,
      to write generic (i.e., not platform dependent) tests for port
      splitting.
      
      Patch #1 adds the new resource and patch #2 adds a selftest.
      
      v2:
      * Add the physical ports resource as a generic devlink resource so that
        it could be re-used by other device drivers
      ====================
      
      Link: https://lore.kernel.org/r/20210121131024.2656154-1-idosch@idosch.orgSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      59a49d96
    • D
      selftests: mlxsw: Add a scale test for physical ports · 5154b1b8
      Danielle Ratson 提交于
      Query the maximum number of supported physical ports using devlink-resource
      and test that this number can be reached by splitting each of the
      splittable ports to its width. Test that an error is returned in case
      the maximum number is exceeded.
      Signed-off-by: NDanielle Ratson <danieller@nvidia.com>
      Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      5154b1b8
    • D
      mlxsw: Register physical ports as a devlink resource · 321f7ab0
      Danielle Ratson 提交于
      The switch ASIC has a limited capacity of physical ('flavour physical'
      in devlink terminology) ports that it can support. While each system is
      brought up with a different number of ports, this number can be
      increased via splitting up to the ASIC's limit.
      
      Expose physical ports as a devlink resource so that user space will have
      visibility to the maximum number of ports that can be supported and the
      current occupancy.
      
      In addition, add a "Generic Resources" section in devlink-resource
      documentation so the different drivers will be aligned by the same resource
      name when exposing to user space.
      Signed-off-by: NDanielle Ratson <danieller@nvidia.com>
      Reviewed-by: NJiri Pirko <jiri@nvidia.com>
      Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      321f7ab0
    • J
      Merge branch 'htb-offload' · 35187642
      Jakub Kicinski 提交于
      Maxim Mikityanskiy says:
      
      ====================
      HTB offload
      
      This series adds support for HTB offload to the HTB qdisc, and adds
      usage to mlx5 driver.
      
      The previous RFCs are available at [1], [2].
      
      The feature is intended to solve the performance bottleneck caused by
      the single lock of the HTB qdisc, which prevents it from scaling well.
      The HTB algorithm itself is offloaded to the device, eliminating the
      need to take the root lock of HTB on every packet. Classification part
      is done in clsact (still in software) to avoid acquiring the lock, which
      imposes a limitation that filters can target only leaf classes.
      
      The speedup on Mellanox ConnectX-6 Dx was 14.2 times in the UDP
      multi-stream test, compared to software HTB implementation (more details
      in the mlx5 patch).
      
      [1]: https://www.spinics.net/lists/netdev/msg628422.html
      [2]: https://www.spinics.net/lists/netdev/msg663548.html
      
      v2 changes:
      
      Fixed sparse and smatch warnings. Formatted HTB patches to 80 chars per
      line.
      
      v3 changes:
      
      Fixed the CI failure on parisc with 16-bit xchg by replacing it with
      WRITE_ONCE. Fixed the capability bits in mlx5_ifc.h and the value of
      MLX5E_QOS_MAX_LEAF_NODES.
      
      v4 changes:
      
      Check if HTB is root when offloading. Add extack for hardware errors.
      Rephrase explanations of how it works in the commit message. Remove %hu
      from format strings. Add resiliency when leaf_del_last fails to create a
      new leaf node.
      ====================
      
      Link: https://lore.kernel.org/r/20210119120815.463334-1-maximmi@mellanox.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      35187642
    • M
      net/mlx5e: Support HTB offload · 214baf22
      Maxim Mikityanskiy 提交于
      This commit adds support for HTB offload in the mlx5e driver.
      
      Performance:
      
        NIC: Mellanox ConnectX-6 Dx
        CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz (24 cores with HT)
      
        100 Gbit/s line rate, 500 UDP streams @ ~200 Mbit/s each
        48 traffic classes, flower used for steering
        No shaping (rate limits set to 4 Gbit/s per TC) - checking for max
        throughput.
      
        Baseline: 98.7 Gbps, 8.25 Mpps
        HTB: 6.7 Gbps, 0.56 Mpps
        HTB offload: 95.6 Gbps, 8.00 Mpps
      
      Limitations:
      
      1. 256 leaf nodes, 3 levels of depth.
      
      2. Granularity for ceil is 1 Mbit/s. Rates are converted to weights, and
      the bandwidth is split among the siblings according to these weights.
      Other parameters for classes are not supported.
      
      Ethtool statistics support for QoS SQs are also added. The counters are
      called qos_txN_*, where N is the QoS queue number (starting from 0, the
      numeration is separate from the normal SQs), and * is the counter name
      (the counters are the same as for the normal SQs).
      Signed-off-by: NMaxim Mikityanskiy <maximmi@mellanox.com>
      Reviewed-by: NTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      214baf22
    • M
      sch_htb: Stats for offloaded HTB · 83271586
      Maxim Mikityanskiy 提交于
      This commit adds support for statistics of offloaded HTB. Bytes and
      packets counters for leaf and inner nodes are supported, the values are
      taken from per-queue qdiscs, and the numbers that the user sees should
      have the same behavior as the software (non-offloaded) HTB.
      Signed-off-by: NMaxim Mikityanskiy <maximmi@mellanox.com>
      Reviewed-by: NTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      83271586
    • M
      sch_htb: Hierarchical QoS hardware offload · d03b195b
      Maxim Mikityanskiy 提交于
      HTB doesn't scale well because of contention on a single lock, and it
      also consumes CPU. This patch adds support for offloading HTB to
      hardware that supports hierarchical rate limiting.
      
      In the offload mode, HTB passes control commands to the driver using
      ndo_setup_tc. The driver has to replicate the whole hierarchy of classes
      and their settings (rate, ceil) in the NIC. Every modification of the
      HTB tree caused by the admin results in ndo_setup_tc being called.
      
      After this setup, the HTB algorithm is done completely in the NIC. An SQ
      (send queue) is created for every leaf class and attached to the
      hierarchy, so that the NIC can calculate and obey aggregated rate
      limits, too. In the future, it can be changed, so that multiple SQs will
      back a single leaf class.
      
      ndo_select_queue is responsible for selecting the right queue that
      serves the traffic class of each packet.
      
      The data path works as follows: a packet is classified by clsact, the
      driver selects a hardware queue according to its class, and the packet
      is enqueued into this queue's qdisc.
      
      This solution addresses two main problems of scaling HTB:
      
      1. Contention by flow classification. Currently the filters are attached
      to the HTB instance as follows:
      
          # tc filter add dev eth0 parent 1:0 protocol ip flower dst_port 80
          classid 1:10
      
      It's possible to move classification to clsact egress hook, which is
      thread-safe and lock-free:
      
          # tc filter add dev eth0 egress protocol ip flower dst_port 80
          action skbedit priority 1:10
      
      This way classification still happens in software, but the lock
      contention is eliminated, and it happens before selecting the TX queue,
      allowing the driver to translate the class to the corresponding hardware
      queue in ndo_select_queue.
      
      Note that this is already compatible with non-offloaded HTB and doesn't
      require changes to the kernel nor iproute2.
      
      2. Contention by handling packets. HTB is not multi-queue, it attaches
      to a whole net device, and handling of all packets takes the same lock.
      When HTB is offloaded, it registers itself as a multi-queue qdisc,
      similarly to mq: HTB is attached to the netdev, and each queue has its
      own qdisc.
      
      Some features of HTB may be not supported by some particular hardware,
      for example, the maximum number of classes may be limited, the
      granularity of rate and ceil parameters may be different, etc. - so, the
      offload is not enabled by default, a new parameter is used to enable it:
      
          # tc qdisc replace dev eth0 root handle 1: htb offload
      Signed-off-by: NMaxim Mikityanskiy <maximmi@mellanox.com>
      Reviewed-by: NTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      d03b195b
    • M
      net: sched: Add extack to Qdisc_class_ops.delete · 4dd78a73
      Maxim Mikityanskiy 提交于
      In a following commit, sch_htb will start using extack in the delete
      class operation to pass hardware errors in offload mode. This commit
      prepares for that by adding the extack parameter to this callback and
      converting usage of the existing qdiscs.
      Signed-off-by: NMaxim Mikityanskiy <maximmi@mellanox.com>
      Reviewed-by: NTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      4dd78a73
    • M
      net: sched: Add multi-queue support to sch_tree_lock · ca1e4ab1
      Maxim Mikityanskiy 提交于
      The existing qdiscs that set TCQ_F_MQROOT don't use sch_tree_lock.
      However, hardware-offloaded HTB will start setting this flag while also
      using sch_tree_lock.
      
      The current implementation of sch_tree_lock basically locks on
      qdisc->dev_queue->qdisc, and it works fine when the tree is attached to
      some queue. However, it's not the case for MQROOT qdiscs: such a qdisc
      is the root itself, and its dev_queue just points to queue 0, while not
      actually being used, because there are real per-queue qdiscs.
      
      This patch changes the logic of sch_tree_lock and sch_tree_unlock to
      lock the qdisc itself if it's the MQROOT.
      Signed-off-by: NMaxim Mikityanskiy <maximmi@mellanox.com>
      Reviewed-by: NTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      ca1e4ab1
    • J
      Merge branch 'tcp-add-cmsg-rx-timestamps-to-rx-zerocopy' · 04a88637
      Jakub Kicinski 提交于
      Arjun Roy says:
      
      ====================
      tcp: add CMSG+rx timestamps to rx. zerocopy
      
      Provide CMSG and receive timestamp support to TCP
      receive zerocopy. Patch 1 refactors CMSG pending state for
      tcp_recvmsg() to avoid the use of magic numbers; patch 2 implements
      receive timestamp via CMSG support for receive zerocopy, and uses the
      constants added in patch 1.
      ====================
      
      Link: https://lore.kernel.org/r/20210121004148.2340206-1-arjunroy.kdev@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      04a88637
    • A
      tcp: Add receive timestamp support for receive zerocopy. · 7eeba170
      Arjun Roy 提交于
      tcp_recvmsg() uses the CMSG mechanism to receive control information
      like packet receive timestamps. This patch adds CMSG fields to
      struct tcp_zerocopy_receive, and provides receive timestamps
      if available to the user.
      Signed-off-by: NArjun Roy <arjunroy@google.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      7eeba170
    • A
      tcp: Remove CMSG magic numbers for tcp_recvmsg(). · 925bba24
      Arjun Roy 提交于
      At present, tcp_recvmsg() uses flags to track if any CMSGs are pending
      and what those CMSGs are. These flags are currently magic numbers,
      used only within tcp_recvmsg().
      
      To prepare for receive timestamp support in tcp receive zerocopy,
      gently refactor these magic numbers into enums.
      Signed-off-by: NArjun Roy <arjunroy@google.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      925bba24
    • J
      Merge branch 'net-bridge-multicast-add-initial-eht-support' · 5225d5f5
      Jakub Kicinski 提交于
      Nikolay Aleksandrov says:
      
      ====================
      net: bridge: multicast: add initial EHT support
      
      This set adds explicit host tracking support for IGMPv3/MLDv2. The
      already present per-port fast leave flag is used to enable it since that
      is the primary goal of EHT, to track a group and its S,Gs usage per-host
      and when left without any interested hosts delete them before the standard
      timers. The EHT code is pretty self-contained and not enabled by default.
      There is no new uAPI added, all of the functionality is currently hidden
      behind the fast leave flag. In the future that will change (more below).
      The host tracking uses two new sets per port group: one having an entry for
      each host which contains that host's view of the group (source list and
      filter mode), and one set which contains an entry for each source having
      an internal set which contains an entry for each host that has reported
      an interest for that source. RB trees are used for all sets so they're
      compact when not used and fast when we need to do lookups.
      To illustrate it:
       [ bridge port group ]
        ` [ host set (rb) ]
         ` [ host entry with a list of sources and filter mode ]
        ` [ source set (rb) ]
         ` [ source entry ]
          ` [ source host set (rb) ]
           ` [ source host entry with a timer ]
      
      The number of tracked sources per host is limited to the maximum total
      number of S,G entries per port group - PG_SRC_ENT_LIMIT (currently 32).
      The number of hosts is unlimited, I think the argument that a local
      attacker can exhaust the memory/cause high CPU usage can be applied to
      fdb entries as well which are unlimited. In the future if needed we can
      add an option to limit these, but I don't think it's necessary for a
      start. All of the new sets are protected by the bridge's multicast lock.
      I'm pretty sure we'll be changing the cases and improving the
      convergence time in the future, but this seems like a good start.
      
      Patch breakdown:
       patch 1 -  4: minor cleanups and preparations for EHT
       patch      5: adds the new structures which will be used in the
                     following patches
       patch      6: adds support to create, destroy and lookup host entries
       patch      7: adds support to create, delete and lokup source set entries
       patch      8: adds a host "delete" function which is just a host's
                     source list flush since that would automatically delete
                     the host
       patch 9 - 10: add support for handling all IGMPv3/MLDv2 report types
                     more information can be found in the individual patches
       patch     11: optmizes a specific TO_INCLUDE use-case with host timeouts
       patch     12: handles per-host filter mode changing (include <-> exclude)
       patch     13: pulls out block group deletion since now it can be
                     deleted in both filter modes
       patch     14: marks deletions done due to fast leave
      
      Future plans:
       - export host information
       - add an option to reduce queries
       - add an option to limit the number of host entries
       - tune more fast leave cases for quicker convergence
      
      By the way I think this is the first open-source EHT implementation, I
      couldn't find any while researching it. :)
      ====================
      
      Link: https://lore.kernel.org/r/20210120145203.1109140-1-razor@blackwall.orgSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      5225d5f5
    • N
      net: bridge: multicast: mark IGMPv3/MLDv2 fast-leave deletes · d5a10222
      Nikolay Aleksandrov 提交于
      Mark groups which were deleted due to fast leave/EHT.
      Signed-off-by: NNikolay Aleksandrov <nikolay@nvidia.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      d5a10222
    • N
      net: bridge: multicast: handle block pg delete for all cases · e87e4b5c
      Nikolay Aleksandrov 提交于
      A block report can result in empty source and host sets for both include
      and exclude groups so if there are no hosts left we can safely remove
      the group. Pull the block group handling so it can cover both cases and
      add a check if EHT requires the delete.
      Signed-off-by: NNikolay Aleksandrov <nikolay@nvidia.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      e87e4b5c
    • N
      net: bridge: multicast: add EHT host filter_mode handling · c9739016
      Nikolay Aleksandrov 提交于
      We should be able to handle host filter mode changing. For exclude mode
      we must create a zero-src entry so the group will be kept even without
      any S,G entries (non-zero source sets). That entry doesn't count to the
      entry limit and can always be created, its timer is refreshed on new
      exclude reports and if we change the host filter mode to include then it
      gets removed and we rely only on the non-zero source sets.
      Signed-off-by: NNikolay Aleksandrov <nikolay@nvidia.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      c9739016
    • N
      net: bridge: multicast: optimize TO_INCLUDE EHT timeouts · b66bf55b
      Nikolay Aleksandrov 提交于
      This is an optimization specifically for TO_INCLUDE which sends queries
      for the older entries and thus lowers the S,G timers to LMQT. If we have
      the following situation for a group in either include or exclude mode:
       - host A was interested in srcs X and Y, but is timing out
       - host B sends TO_INCLUDE src Z, the bridge lowers X and Y's timeouts
         to LMQT
       - host B sends BLOCK src Z after LMQT time has passed
       => since host B is the last host we can delete the group, but if we
          still have host A's EHT entries for X and Y (i.e. if they weren't
          lowered to LMQT previously) then we'll have to wait another LMQT
          time before deleting the group, with this optimization we can
          directly remove it regardless of the group mode as there are no more
          interested hosts
      Signed-off-by: NNikolay Aleksandrov <nikolay@nvidia.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      b66bf55b
    • N
      net: bridge: multicast: add EHT include and exclude handling · ddc255d9
      Nikolay Aleksandrov 提交于
      Add support for IGMPv3/MLDv2 include and exclude EHT handling. Similar to
      how the reports are processed we have 2 cases when the group is in include
      or exclude mode, these are processed as follows:
       - group include
        - is_include: create missing entries
        - to_include: flush existing entries and create a new set from the
          report, obviously if the src set is empty then we delete the group
      
       - group exclude
        - is_exclude: create missing entries
        - to_exclude: flush existing entries and create a new set from the
          report, any empty source set entries are removed
      
      If the group is in a different mode then we just flush all entries reported
      by the host and we create a new set with the new mode entries created from
      the report. If the report is include type, the source list is empty and
      the group has empty sources' set then we remove it. Any source set entries
      which are empty are removed as well. If the group is in exclude mode it
      can exist without any S,G entries (allowing for all traffic to pass).
      Signed-off-by: NNikolay Aleksandrov <nikolay@nvidia.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      ddc255d9