1. 20 4月, 2022 16 次提交
    • H
      net: hns3: add log for setting tx spare buf size · 2373b35c
      Hao Chen 提交于
      For the active tx spare buffer size maybe changed according
      to the page size, so add log to notice it.
      Signed-off-by: NHao Chen <chenhao288@hisilicon.com>
      Signed-off-by: NGuangbin Huang <huangguangbin2@huawei.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2373b35c
    • J
      net: hns3: add failure logs in hclge_set_vport_mtu · bcc7a98f
      Jie Wang 提交于
      Currently, There is a low probability that pf mtu configuration fails, but
      the information in logs is insufficient for problem locating when the VF
      mtu value is illegally modified.
      
      So record the vf index and vf mtu value at the failure scenario.
      Signed-off-by: NJie Wang <wangjie125@huawei.com>
      Signed-off-by: NGuangbin Huang <huangguangbin2@huawei.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bcc7a98f
    • J
      net: hns3: refine the definition for struct hclge_pf_to_vf_msg · 6fde96df
      Jian Shen 提交于
      The struct hclge_pf_to_vf_msg is used for mailbox message from
      PF to VF, including both response and request. But its definition
      can only indicate respone, which makes the message data copy in
      function hclge_send_mbx_msg() unreadable. So refine it by edding
      a general message definition into it.
      Signed-off-by: NJian Shen <shenjian15@huawei.com>
      Signed-off-by: NGuangbin Huang <huangguangbin2@huawei.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6fde96df
    • H
      net: hns3: refactor hns3_set_ringparam() · 07fdc163
      Hao Chen 提交于
      Use struct hns3_ring_param to replace variable new/old_xxx and
      add hns3_is_ringparam_changed() to judge them if is changed to
      improve code readability.
      Signed-off-by: NHao Chen <chenhao288@hisilicon.com>
      Signed-off-by: NGuangbin Huang <huangguangbin2@huawei.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      07fdc163
    • Y
      net: hns3: add ethtool parameter check for CQE/EQE mode · 286c61e7
      Yufeng Mo 提交于
      For DEVICE_VERSION_V2, the hardware does not support the CQE mode.
      So add capability bit for coalesce CQE mode and add parameter check
      for it in ethtool.
      Signed-off-by: NYufeng Mo <moyufeng@huawei.com>
      Signed-off-by: NGuangbin Huang <huangguangbin2@huawei.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      286c61e7
    • D
      Merge branch 'atlantic-xdp-multi-buffer' · e97e917b
      David S. Miller 提交于
      [PATCH net-next v5 0/3] net: atlantic: Add XDP support
      @ 2022-04-17 10:12 Taehee Yoo
        2022-04-17 10:12 ` [PATCH net-next v5 1/3] net: atlantic: Implement xdp control plane Taehee Yoo
                         ` (2 more replies)
        0 siblings, 3 replies; 4+ messages in thread
      From: Taehee Yoo @ 2022-04-17 10:12 UTC (permalink / raw)
        To: davem, kuba, pabeni, netdev, irusskikh, ast, daniel, hawk,
      	john.fastabend, andrii, kafai, songliubraving, yhs, kpsingh, bpf
        Cc: ap420073
      
      This patchset is to make atlantic to support multi-buffer XDP.
      
      The first patch implement control plane of xdp.
      The aq_xdp(), callback of .xdp_bpf is added.
      
      The second patch implements data plane of xdp.
      XDP_TX, XDP_DROP, and XDP_PASS is supported.
      __aq_ring_xdp_clean() is added to receive and execute xdp program.
      aq_nic_xmit_xdpf() is added to send packet by XDP.
      
      The third patch implements callback of .ndo_xdp_xmit.
      aq_xdp_xmit() is added to send redirected packets and it internally
      calls aq_nic_xmit_xdpf().
      
      Memory model is MEM_TYPE_PAGE_SHARED.
      
      Order-2 page allocation is used when XDP is enabled.
      
      LRO will be disabled if XDP program doesn't supports multi buffer.
      
      AQC chip supports 32 multi-queues and 8 vectors(irq).
      There are two options.
      1. under 8 cores and maximum 4 tx queues per core.
      2. under 4 cores and maximum 8 tx queues per core.
      
      Like other drivers, these tx queues can be used only for XDP_TX,
      XDP_REDIRECT queue. If so, no tx_lock is needed.
      But this patchset doesn't use this strategy because getting hardware tx
      queue index cost is too high.
      So, tx_lock is used in the aq_nic_xmit_xdpf().
      
      single-core, single queue, 80% cpu utilization.
      
        32.30%  [kernel]                  [k] aq_get_rxpages_xdp
        10.44%  [kernel]                  [k] aq_hw_read_reg <---------- here
         9.86%  bpf_prog_xxx_xdp_prog_tx  [k] bpf_prog_xxx_xdp_prog_tx
         5.51%  [kernel]                  [k] aq_ring_rx_clean
      
      single-core, 8 queues, 100% cpu utilization, half PPS.
      
        52.03%  [kernel]                  [k] aq_hw_read_reg <---------- here
        18.24%  [kernel]                  [k] aq_get_rxpages_xdp
         4.30%  [kernel]                  [k] hw_atl_b0_hw_ring_rx_receive
         4.24%  bpf_prog_xxx_xdp_prog_tx  [k] bpf_prog_xxx_xdp_prog_tx
         2.79%  [kernel]                  [k] aq_ring_rx_clean
      
      Performance result(64 Byte)
      1. XDP_TX
        a. xdp_geieric, single core
          - 2.5Mpps, 100% cpu
        b. xdp_driver, single core
          - 4.5Mpps, 80% cpu
        c. xdp_generic, 8 core(hyper thread)
          - 6.3Mpps, 40% cpu
        d. xdp_driver, 8 core(hyper thread)
          - 6.3Mpps, 30% cpu
      
      2. XDP_REDIRECT
        a. xdp_generic, single core
          - 2.3Mpps
        b. xdp_driver, single core
          - 4.5Mpps
      
      v5:
       - Use MEM_TYPE_PAGE_SHARED instead of MEM_TYPE_PAGE_ORDER0
       - Use 2K frame size instead of 3K
       - Use order-2 page allocation instead of order-0
       - Rename aq_get_rxpage() to aq_alloc_rxpages()
       - Add missing PageFree stats for ethtool
       - Remove aq_unset_rxpage_xdp(), introduced by v2 patch due to
         change of memory model
       - Fix wrong last parameter value of xdp_prepare_buff()
       - Add aq_get_rxpages_xdp() to increase page reference count
      
      v4:
       - Fix compile warning
      
      v3:
       - Change wrong PPS performance result 40% -> 80% in single
         core(Intel i3-12100)
       - Separate aq_nic_map_xdp() from aq_nic_map_skb()
       - Drop multi buffer packets if single buffer XDP is attached
       - Disable LRO when single buffer XDP is attached
       - Use xdp_get_{frame/buff}_len()
      
      v2:
       - Do not use inline in C file
      
      Taehee Yoo (3):
        net: atlantic: Implement xdp control plane
        net: atlantic: Implement xdp data plane
        net: atlantic: Implement .ndo_xdp_xmit handler
      
       .../net/ethernet/aquantia/atlantic/aq_cfg.h   |   1 +
       .../ethernet/aquantia/atlantic/aq_ethtool.c   |   9 +
       .../net/ethernet/aquantia/atlantic/aq_main.c  |  87 ++++
       .../net/ethernet/aquantia/atlantic/aq_main.h  |   2 +
       .../net/ethernet/aquantia/atlantic/aq_nic.c   | 136 ++++++
       .../net/ethernet/aquantia/atlantic/aq_nic.h   |   5 +
       .../net/ethernet/aquantia/atlantic/aq_ring.c  | 409 ++++++++++++++++--
       .../net/ethernet/aquantia/atlantic/aq_ring.h  |  21 +-
       .../net/ethernet/aquantia/atlantic/aq_vec.c   |  23 +-
       .../net/ethernet/aquantia/atlantic/aq_vec.h   |   6 +
       .../aquantia/atlantic/hw_atl/hw_atl_a0.c      |   6 +-
       .../aquantia/atlantic/hw_atl/hw_atl_b0.c      |  10 +-
       12 files changed, 670 insertions(+), 45 deletions(-)
      
      --
      2.17.1
      
      ^ permalink raw reply	[flat|nested] 4+ messages in thread
      * [PATCH net-next v5 1/3] net: atlantic: Implement xdp control plane
        2022-04-17 10:12 [PATCH net-next v5 0/3] net: atlantic: Add XDP support Taehee Yoo
      @ 2022-04-17 10:12 ` Taehee Yoo
        2022-04-17 10:12 ` [PATCH net-next v5 2/3] net: atlantic: Implement xdp data plane Taehee Yoo
        2022-04-17 10:12 ` [PATCH net-next v5 3/3] net: atlantic: Implement .ndo_xdp_xmit handler Taehee Yoo
        2 siblings, 0 replies; 4+ messages in thread
      From: Taehee Yoo @ 2022-04-17 10:12 UTC (permalink / raw)
        To: davem, kuba, pabeni, netdev, irusskikh, ast, daniel, hawk,
      	john.fastabend, andrii, kafai, songliubraving, yhs, kpsingh, bpf
        Cc: ap420073
      
      aq_xdp() is a xdp setup callback function for Atlantic driver.
      When XDP is attached or detached, the device will be restarted because
      it uses different headroom, tailroom, and page order value.
      
      If XDP enabled, it switches default page order value from 0 to 2.
      Because the default maximum frame size is still 2K and it needs
      additional area for headroom and tailroom.
      The total size(headroom + frame size + tailroom) is 2624.
      So, 1472Bytes will be always wasted for every frame.
      But when order-2 is used, these pages can be used 6 times
      with flip strategy.
      It means only about 106Bytes per frame will be wasted.
      
      Also, It supports xdp fragment feature.
      MTU can be 16K if xdp prog supports xdp fragment.
      If not, MTU can not exceed 2K - ETH_HLEN - ETH_FCS.
      
      And a static key is added and It will be used to call the xdp_clean
      handler in ->poll(). data plane implementation will be contained
      the followed patch.
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      ---
      
      v5:
       - Use MEM_TYPE_PAGE_SHARED instead of MEM_TYPE_PAGE_ORDER0
       - Use 2K frame size instead of 3K
       - Use order-2 page allocation instead of order-0
       - Rename aq_get_rxpage() to aq_alloc_rxpages()
      
      v4:
       - No changed
      
      v3:
       - Disable LRO when single buffer XDP is attached
      
      v2:
       - No changed
      e97e917b
    • T
      net: atlantic: Implement .ndo_xdp_xmit handler · 45638f01
      Taehee Yoo 提交于
      aq_xdp_xmit() is the callback function of .ndo_xdp_xmit.
      It internally calls aq_nic_xmit_xdpf() to send packet.
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      45638f01
    • T
      net: atlantic: Implement xdp data plane · 26efaef7
      Taehee Yoo 提交于
      It supports XDP_PASS, XDP_DROP and multi buffer.
      
      The new function aq_nic_xmit_xdpf() is used to send packet with
      xdp_frame and internally it calls aq_nic_map_xdp().
      
      AQC chip supports 32 multi-queues and 8 vectors(irq).
      there are two option
      1. under 8 cores and 4 tx queues per core.
      2. under 4 cores and 8 tx queues per core.
      
      Like ixgbe, these tx queues can be used only for XDP_TX, XDP_REDIRECT
      queue. If so, no tx_lock is needed.
      But this patchset doesn't use this strategy because getting hardware tx
      queue index cost is too high.
      So, tx_lock is used in the aq_nic_xmit_xdpf().
      
      single-core, single queue, 80% cpu utilization.
      
        30.75%  bpf_prog_xxx_xdp_prog_tx  [k] bpf_prog_xxx_xdp_prog_tx
        10.35%  [kernel]                  [k] aq_hw_read_reg <---------- here
         4.38%  [kernel]                  [k] get_page_from_freelist
      
      single-core, 8 queues, 100% cpu utilization, half PPS.
      
        45.56%  [kernel]                  [k] aq_hw_read_reg <---------- here
        17.58%  bpf_prog_xxx_xdp_prog_tx  [k] bpf_prog_xxx_xdp_prog_tx
         4.72%  [kernel]                  [k] hw_atl_b0_hw_ring_rx_receive
      
      The new function __aq_ring_xdp_clean() is a xdp rx handler and this is
      called only when XDP is attached.
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      26efaef7
    • T
      net: atlantic: Implement xdp control plane · 0d14657f
      Taehee Yoo 提交于
      aq_xdp() is a xdp setup callback function for Atlantic driver.
      When XDP is attached or detached, the device will be restarted because
      it uses different headroom, tailroom, and page order value.
      
      If XDP enabled, it switches default page order value from 0 to 2.
      Because the default maximum frame size is still 2K and it needs
      additional area for headroom and tailroom.
      The total size(headroom + frame size + tailroom) is 2624.
      So, 1472Bytes will be always wasted for every frame.
      But when order-2 is used, these pages can be used 6 times
      with flip strategy.
      It means only about 106Bytes per frame will be wasted.
      
      Also, It supports xdp fragment feature.
      MTU can be 16K if xdp prog supports xdp fragment.
      If not, MTU can not exceed 2K - ETH_HLEN - ETH_FCS.
      
      And a static key is added and It will be used to call the xdp_clean
      handler in ->poll(). data plane implementation will be contained
      the followed patch.
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0d14657f
    • D
      Merge branch 'dsa-cross-chip-notifier-cleanup' · 8ab38ed7
      David S. Miller 提交于
      Vladimir Oltean says:
      
      ====================
      DSA cross-chip notifier cleanups
      
      This patch set makes the following improvements:
      
      - Cross-chip notifiers pass a switch index, port index, sometimes tree
        index, all as integers. Sometimes we need to recover the struct
        dsa_port based on those integers. That recovery involves traversing a
        list. By passing directly a pointer to the struct dsa_port we can
        avoid that, and the indices passed previously can still be obtained
        from the passed struct dsa_port.
      
      - Resetting VLAN filtering on a switch has explicit code to make it run
        on a single switch, so it has no place to stay in the cross-chip
        notifier code. Move it out.
      
      - Changing the MTU on a user port affects only that single port, yet the
        code passes through the cross-chip notifier layer where all switches
        are notified. Avoid that.
      
      - Other related cosmetic changes in the MTU changing procedure.
      
      Apart from the slight improvement in performance given by
      (a) doing less work in cross-chip notifiers
      (b) emitting less cross-chip notifiers
      we also end up with about 100 less lines of code.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8ab38ed7
    • V
      net: dsa: don't emit targeted cross-chip notifiers for MTU change · be6ff966
      Vladimir Oltean 提交于
      A cross-chip notifier with "targeted_match=true" is one that matches
      only the local port of the switch that emitted it. In other words,
      passing through the cross-chip notifier layer serves no purpose.
      
      Eliminate this concept by calling directly ds->ops->port_change_mtu
      instead of emitting a targeted cross-chip notifier. This leaves the
      DSA_NOTIFIER_MTU event being emitted only for MTU updates on the CPU
      port, which need to be reflected also across all DSA links.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      be6ff966
    • V
      net: dsa: drop dsa_slave_priv from dsa_slave_change_mtu · 4715029f
      Vladimir Oltean 提交于
      We can get a hold of the "ds" pointer directly from "dp", no need for
      the dsa_slave_priv.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4715029f
    • V
      net: dsa: avoid one dsa_to_port() in dsa_slave_change_mtu · cf1c39d3
      Vladimir Oltean 提交于
      We could retrieve the cpu_dp pointer directly from the "dp" we already
      have, no need to resort to dsa_to_port(ds, port).
      
      This change also removes the need for an "int port", so that is also
      deleted.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cf1c39d3
    • V
      net: dsa: use dsa_tree_for_each_user_port in dsa_slave_change_mtu · b2033a05
      Vladimir Oltean 提交于
      Use the more conventional iterator over user ports instead of explicitly
      ignoring them, and use the more conventional name "other_dp" instead of
      "dp_iter", for readability.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b2033a05
    • V
      net: dsa: make cross-chip notifiers more efficient for host events · 726816a1
      Vladimir Oltean 提交于
      To determine whether a given port should react to the port targeted by
      the notifier, dsa_port_host_vlan_match() and dsa_port_host_address_match()
      look at the positioning of the switch port currently executing the
      notifier relative to the switch port for which the notifier was emitted.
      
      To maintain stylistic compatibility with the other match functions from
      switch.c, the host address and host VLAN match functions take the
      notifier information about targeted port, switch and tree indices as
      argument. However, these functions only use that information to retrieve
      the struct dsa_port *targeted_dp, which is an invariant for the outer
      loop that calls them. So it makes more sense to calculate the targeted
      dp only once, and pass it to them as argument.
      
      But furthermore, the targeted dp is actually known at the time the call
      to dsa_port_notify() is made. It is just that we decide to only save the
      indices of the port, switch and tree in the notifier structure, just to
      retrace our steps and find the dp again using dsa_switch_find() and
      dsa_to_port().
      
      But both the above functions are relatively expensive, since they need
      to iterate through lists. It appears more straightforward to make all
      notifiers just pass the targeted dp inside their info structure, and
      have the code that needs the indices to look at info->dp->index instead
      of info->port, or info->dp->ds->index instead of info->sw_index, or
      info->dp->ds->dst->index instead of info->tree_index.
      
      For the sake of consistency, all cross-chip notifiers are converted to
      pass the "dp" directly.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      726816a1
    • V
      net: dsa: move reset of VLAN filtering to dsa_port_switchdev_unsync_attrs · 8e9e678e
      Vladimir Oltean 提交于
      In dsa_port_switchdev_unsync_attrs() there is a comment that resetting
      the VLAN filtering isn't done where it is expected. And since commit
      108dc874 ("net: dsa: Avoid cross-chip syncing of VLAN filtering"),
      there is no reason to handle this in switch.c either.
      
      Therefore, move the logic to port.c, and adapt it slightly to the data
      structures and naming conventions from there.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8e9e678e
  2. 19 4月, 2022 8 次提交
    • P
      Merge branch 'rtnetlink-improve-alt_ifname-config-and-fix-dangerous-group-usage' · cc4bdef2
      Paolo Abeni 提交于
      Florent Fourcot says:
      
      ====================
      rtnetlink: improve ALT_IFNAME config and fix dangerous GROUP usage
      
      First commit forbids dangerous calls when both IFNAME and GROUP are
      given, since it can introduce unexpected behaviour when IFNAME does not
      match any interface.
      
      Second patch achieves primary goal of this patchset to fix/improve
      IFLA_ALT_IFNAME attribute, since previous code was never working for
      newlink/setlink. ip-link command is probably getting interface index
      before, and was not using this feature.
      
      Last two patches are improving error code on corner cases.
      
      Changes in v2:
        * Remove ifname argument in rtnl_dev_get/do_setlink
          functions (simplify code)
        * Use a boolean to avoid condition duplication in __rtnl_newlink
      
      Changes in v3:
        * Simplify rtnl_dev_get signature
      
      Changes in v4:
        * Rename link_lookup to link_specified
      
      Changes in v5:
        * Re-order patches
      ====================
      
      Link: https://lore.kernel.org/r/20220415165330.10497-1-florent.fourcot@wifirst.frSigned-off-by: NPaolo Abeni <pabeni@redhat.com>
      cc4bdef2
    • F
      rtnetlink: return EINVAL when request cannot succeed · b6177d32
      Florent Fourcot 提交于
      A request without interface name/interface index/interface group cannot
      work. We should return EINVAL
      Signed-off-by: NFlorent Fourcot <florent.fourcot@wifirst.fr>
      Signed-off-by: NBrian Baboch <brian.baboch@wifirst.fr>
      Reviewed-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      b6177d32
    • F
      rtnetlink: return ENODEV when IFLA_ALT_IFNAME is used in dellink · dee04163
      Florent Fourcot 提交于
      If IFLA_ALT_IFNAME is set and given interface is not found,
      we should return ENODEV and be consistent with IFLA_IFNAME
      behaviour
      This commit extends feature of commit 76c9ac0e,
      "net: rtnetlink: add possibility to use alternative names as message handle"
      
      CC: Jiri Pirko <jiri@mellanox.com>
      Signed-off-by: NFlorent Fourcot <florent.fourcot@wifirst.fr>
      Signed-off-by: NBrian Baboch <brian.baboch@wifirst.fr>
      Reviewed-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      dee04163
    • F
      rtnetlink: enable alt_ifname for setlink/newlink · 5ea08b52
      Florent Fourcot 提交于
      buffer called "ifname" given in function rtnl_dev_get
      is always valid when called by setlink/newlink,
      but contains only empty string when IFLA_IFNAME is not given. So
      IFLA_ALT_IFNAME is always ignored
      
      This patch fixes rtnl_dev_get function with a remove of ifname argument,
      and move ifname copy in do_setlink when required.
      
      It extends feature of commit 76c9ac0e,
      "net: rtnetlink: add possibility to use alternative names as message
      handle""
      
      CC: Jiri Pirko <jiri@mellanox.com>
      Signed-off-by: NFlorent Fourcot <florent.fourcot@wifirst.fr>
      Signed-off-by: NBrian Baboch <brian.baboch@wifirst.fr>
      Reviewed-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      5ea08b52
    • F
      rtnetlink: return ENODEV when ifname does not exist and group is given · ef2a7c90
      Florent Fourcot 提交于
      When the interface does not exist, and a group is given, the given
      parameters are being set to all interfaces of the given group. The given
      IFNAME/ALT_IF_NAME are being ignored in that case.
      
      That can be dangerous since a typo (or a deleted interface) can produce
      weird side effects for caller:
      
      Case 1:
      
       IFLA_IFNAME=valid_interface
       IFLA_GROUP=1
       MTU=1234
      
      Case 1 will update MTU and group of the given interface "valid_interface".
      
      Case 2:
      
       IFLA_IFNAME=doesnotexist
       IFLA_GROUP=1
       MTU=1234
      
      Case 2 will update MTU of all interfaces in group 1. IFLA_IFNAME is
      ignored in this case
      
      This behaviour is not consistent and dangerous. In order to fix this issue,
      we now return ENODEV when the given IFNAME does not exist.
      Signed-off-by: NFlorent Fourcot <florent.fourcot@wifirst.fr>
      Signed-off-by: NBrian Baboch <brian.baboch@wifirst.fr>
      Reviewed-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      ef2a7c90
    • P
      Merge branch 'net-sched-allow-user-to-select-txqueue' · 8b11c35d
      Paolo Abeni 提交于
      Tonghao Zhang says:
      
      ====================
      net: sched: allow user to select txqueue
      
      From: Tonghao Zhang <xiangxia.m.yue@gmail.com>
      
      Patch 1 allow user to select txqueue in clsact hook.
      Patch 2 support skbhash to select txqueue.
      ====================
      
      Link: https://lore.kernel.org/r/20220415164046.26636-1-xiangxia.m.yue@gmail.comSigned-off-by: NPaolo Abeni <pabeni@redhat.com>
      8b11c35d
    • T
      net: sched: support hash selecting tx queue · 38a6f086
      Tonghao Zhang 提交于
      This patch allows users to pick queue_mapping, range
      from A to B. Then we can load balance packets from A
      to B tx queue. The range is an unsigned 16bit value
      in decimal format.
      
      $ tc filter ... action skbedit queue_mapping skbhash A B
      
      "skbedit queue_mapping QUEUE_MAPPING" (from "man 8 tc-skbedit")
      is enhanced with flags: SKBEDIT_F_TXQ_SKBHASH
      
        +----+      +----+      +----+
        | P1 |      | P2 |      | Pn |
        +----+      +----+      +----+
          |           |           |
          +-----------+-----------+
                      |
                      | clsact/skbedit
                      |      MQ
                      v
          +-----------+-----------+
          | q0        | qn        | qm
          v           v           v
        HTB/FQ       FIFO   ...  FIFO
      
      For example:
      If P1 sends out packets to different Pods on other host, and
      we want distribute flows from qn - qm. Then we can use skb->hash
      as hash.
      
      setup commands:
      $ NETDEV=eth0
      $ ip netns add n1
      $ ip link add ipv1 link $NETDEV type ipvlan mode l2
      $ ip link set ipv1 netns n1
      $ ip netns exec n1 ifconfig ipv1 2.2.2.100/24 up
      
      $ tc qdisc add dev $NETDEV clsact
      $ tc filter add dev $NETDEV egress protocol ip prio 1 \
              flower skip_hw src_ip 2.2.2.100 action skbedit queue_mapping skbhash 2 6
      $ tc qdisc add dev $NETDEV handle 1: root mq
      $ tc qdisc add dev $NETDEV parent 1:1 handle 2: htb
      $ tc class add dev $NETDEV parent 2: classid 2:1 htb rate 100kbit
      $ tc class add dev $NETDEV parent 2: classid 2:2 htb rate 200kbit
      $ tc qdisc add dev $NETDEV parent 1:2 tbf rate 100mbit burst 100mb latency 1
      $ tc qdisc add dev $NETDEV parent 1:3 pfifo
      $ tc qdisc add dev $NETDEV parent 1:4 pfifo
      $ tc qdisc add dev $NETDEV parent 1:5 pfifo
      $ tc qdisc add dev $NETDEV parent 1:6 pfifo
      $ tc qdisc add dev $NETDEV parent 1:7 pfifo
      
      $ ip netns exec n1 iperf3 -c 2.2.2.1 -i 1 -t 10 -P 10
      
      pick txqueue from 2 - 6:
      $ ethtool -S $NETDEV | grep -i tx_queue_[0-9]_bytes
           tx_queue_0_bytes: 42
           tx_queue_1_bytes: 0
           tx_queue_2_bytes: 11442586444
           tx_queue_3_bytes: 7383615334
           tx_queue_4_bytes: 3981365579
           tx_queue_5_bytes: 3983235051
           tx_queue_6_bytes: 6706236461
           tx_queue_7_bytes: 42
           tx_queue_8_bytes: 0
           tx_queue_9_bytes: 0
      
      txqueues 2 - 6 are mapped to classid 1:3 - 1:7
      $ tc -s class show dev $NETDEV
      ...
      class mq 1:3 root leaf 8002:
       Sent 11949133672 bytes 7929798 pkt (dropped 0, overlimits 0 requeues 0)
       backlog 0b 0p requeues 0
      class mq 1:4 root leaf 8003:
       Sent 7710449050 bytes 5117279 pkt (dropped 0, overlimits 0 requeues 0)
       backlog 0b 0p requeues 0
      class mq 1:5 root leaf 8004:
       Sent 4157648675 bytes 2758990 pkt (dropped 0, overlimits 0 requeues 0)
       backlog 0b 0p requeues 0
      class mq 1:6 root leaf 8005:
       Sent 4159632195 bytes 2759990 pkt (dropped 0, overlimits 0 requeues 0)
       backlog 0b 0p requeues 0
      class mq 1:7 root leaf 8006:
       Sent 7003169603 bytes 4646912 pkt (dropped 0, overlimits 0 requeues 0)
       backlog 0b 0p requeues 0
      ...
      
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Jiri Pirko <jiri@resnulli.us>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Jakub Kicinski <kuba@kernel.org>
      Cc: Jonathan Lemon <jonathan.lemon@gmail.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Alexander Lobakin <alobakin@pm.me>
      Cc: Paolo Abeni <pabeni@redhat.com>
      Cc: Talal Ahmad <talalahmad@google.com>
      Cc: Kevin Hao <haokexin@gmail.com>
      Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
      Cc: Antoine Tenart <atenart@kernel.org>
      Cc: Wei Wang <weiwan@google.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: NTonghao Zhang <xiangxia.m.yue@gmail.com>
      Reviewed-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      38a6f086
    • T
      net: sched: use queue_mapping to pick tx queue · 2f1e85b1
      Tonghao Zhang 提交于
      This patch fixes issue:
      * If we install tc filters with act_skbedit in clsact hook.
        It doesn't work, because netdev_core_pick_tx() overwrites
        queue_mapping.
      
        $ tc filter ... action skbedit queue_mapping 1
      
      And this patch is useful:
      * We can use FQ + EDT to implement efficient policies. Tx queues
        are picked by xps, ndo_select_queue of netdev driver, or skb hash
        in netdev_core_pick_tx(). In fact, the netdev driver, and skb
        hash are _not_ under control. xps uses the CPUs map to select Tx
        queues, but we can't figure out which task_struct of pod/containter
        running on this cpu in most case. We can use clsact filters to classify
        one pod/container traffic to one Tx queue. Why ?
      
        In containter networking environment, there are two kinds of pod/
        containter/net-namespace. One kind (e.g. P1, P2), the high throughput
        is key in these applications. But avoid running out of network resource,
        the outbound traffic of these pods is limited, using or sharing one
        dedicated Tx queues assigned HTB/TBF/FQ Qdisc. Other kind of pods
        (e.g. Pn), the low latency of data access is key. And the traffic is not
        limited. Pods use or share other dedicated Tx queues assigned FIFO Qdisc.
        This choice provides two benefits. First, contention on the HTB/FQ Qdisc
        lock is significantly reduced since fewer CPUs contend for the same queue.
        More importantly, Qdisc contention can be eliminated completely if each
        CPU has its own FIFO Qdisc for the second kind of pods.
      
        There must be a mechanism in place to support classifying traffic based on
        pods/container to different Tx queues. Note that clsact is outside of Qdisc
        while Qdisc can run a classifier to select a sub-queue under the lock.
      
        In general recording the decision in the skb seems a little heavy handed.
        This patch introduces a per-CPU variable, suggested by Eric.
      
        The xmit.skip_txqueue flag is firstly cleared in __dev_queue_xmit().
        - Tx Qdisc may install that skbedit actions, then xmit.skip_txqueue flag
          is set in qdisc->enqueue() though tx queue has been selected in
          netdev_tx_queue_mapping() or netdev_core_pick_tx(). That flag is cleared
          firstly in __dev_queue_xmit(), is useful:
        - Avoid picking Tx queue with netdev_tx_queue_mapping() in next netdev
          in such case: eth0 macvlan - eth0.3 vlan - eth0 ixgbe-phy:
          For example, eth0, macvlan in pod, which root Qdisc install skbedit
          queue_mapping, send packets to eth0.3, vlan in host. In __dev_queue_xmit() of
          eth0.3, clear the flag, does not select tx queue according to skb->queue_mapping
          because there is no filters in clsact or tx Qdisc of this netdev.
          Same action taked in eth0, ixgbe in Host.
        - Avoid picking Tx queue for next packet. If we set xmit.skip_txqueue
          in tx Qdisc (qdisc->enqueue()), the proper way to clear it is clearing it
          in __dev_queue_xmit when processing next packets.
      
        For performance reasons, use the static key. If user does not config the NET_EGRESS,
        the patch will not be compiled.
      
        +----+      +----+      +----+
        | P1 |      | P2 |      | Pn |
        +----+      +----+      +----+
          |           |           |
          +-----------+-----------+
                      |
                      | clsact/skbedit
                      |      MQ
                      v
          +-----------+-----------+
          | q0        | q1        | qn
          v           v           v
        HTB/FQ      HTB/FQ  ...  FIFO
      
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Jiri Pirko <jiri@resnulli.us>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Jakub Kicinski <kuba@kernel.org>
      Cc: Jonathan Lemon <jonathan.lemon@gmail.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Alexander Lobakin <alobakin@pm.me>
      Cc: Paolo Abeni <pabeni@redhat.com>
      Cc: Talal Ahmad <talalahmad@google.com>
      Cc: Kevin Hao <haokexin@gmail.com>
      Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
      Cc: Antoine Tenart <atenart@kernel.org>
      Cc: Wei Wang <weiwan@google.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Suggested-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NTonghao Zhang <xiangxia.m.yue@gmail.com>
      Acked-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      2f1e85b1
  3. 18 4月, 2022 16 次提交