1. 31 8月, 2021 1 次提交
  2. 30 8月, 2021 2 次提交
  3. 28 8月, 2021 1 次提交
    • R
      ipv6: add IFLA_INET6_RA_MTU to expose mtu value · 49b99da2
      Rocco Yue 提交于
      The kernel provides a "/proc/sys/net/ipv6/conf/<iface>/mtu"
      file, which can temporarily record the mtu value of the last
      received RA message when the RA mtu value is lower than the
      interface mtu, but this proc has following limitations:
      
      (1) when the interface mtu (/sys/class/net/<iface>/mtu) is
      updeated, mtu6 (/proc/sys/net/ipv6/conf/<iface>/mtu) will
      be updated to the value of interface mtu;
      (2) mtu6 (/proc/sys/net/ipv6/conf/<iface>/mtu) only affect
      ipv6 connection, and not affect ipv4.
      
      Therefore, when the mtu option is carried in the RA message,
      there will be a problem that the user sometimes cannot obtain
      RA mtu value correctly by reading mtu6.
      
      After this patch set, if a RA message carries the mtu option,
      you can send a netlink msg which nlmsg_type is RTM_GETLINK,
      and then by parsing the attribute of IFLA_INET6_RA_MTU to
      get the mtu value carried in the RA message received on the
      inet6 device. In addition, you can also get a link notification
      when ra_mtu is updated so it doesn't have to poll.
      
      In this way, if the MTU values that the device receives from
      the network in the PCO IPv4 and the RA IPv6 procedures are
      different, the user can obtain the correct ipv6 ra_mtu value
      and compare the value of ra_mtu and ipv4 mtu, then the device
      can use the lower MTU value for both IPv4 and IPv6.
      Signed-off-by: NRocco Yue <rocco.yue@mediatek.com>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Link: https://lore.kernel.org/r/20210827150412.9267-1-rocco.yue@mediatek.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      49b99da2
  4. 26 8月, 2021 1 次提交
  5. 25 8月, 2021 6 次提交
  6. 24 8月, 2021 3 次提交
    • Z
      ipv6: correct comments about fib6_node sernum · 446e7f21
      zhang kai 提交于
      correct comments in set and get fn_sernum
      Signed-off-by: Nzhang kai <zhangkaiheb@126.com>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      446e7f21
    • V
      net: dsa: let drivers state that they need VLAN filtering while standalone · 58adf9dc
      Vladimir Oltean 提交于
      As explained in commit e358bef7 ("net: dsa: Give drivers the chance
      to veto certain upper devices"), the hellcreek driver uses some tricks
      to comply with the network stack expectations: it enforces port
      separation in standalone mode using VLANs. For untagged traffic,
      bridging between ports is prevented by using different PVIDs, and for
      VLAN-tagged traffic, it never accepts 8021q uppers with the same VID on
      two ports, so packets with one VLAN cannot leak from one port to another.
      
      That is almost fine*, and has worked because hellcreek relied on an
      implicit behavior of the DSA core that was changed by the previous
      patch: the standalone ports declare the 'rx-vlan-filter' feature as 'on
      [fixed]'. Since most of the DSA drivers are actually VLAN-unaware in
      standalone mode, that feature was actually incorrectly reflecting the
      hardware/driver state, so there was a desire to fix it. This leaves the
      hellcreek driver in a situation where it has to explicitly request this
      behavior from the DSA framework.
      
      We configure the ports as follows:
      
      - Standalone: 'rx-vlan-filter' is on. An 8021q upper on top of a
        standalone hellcreek port will go through dsa_slave_vlan_rx_add_vid
        and will add a VLAN to the hardware tables, giving the driver the
        opportunity to refuse it through .port_prechangeupper.
      
      - Bridged with vlan_filtering=0: 'rx-vlan-filter' is off. An 8021q upper
        on top of a bridged hellcreek port will not go through
        dsa_slave_vlan_rx_add_vid, because there will not be any attempt to
        offload this VLAN. The driver already disables VLAN awareness, so that
        upper should receive the traffic it needs.
      
      - Bridged with vlan_filtering=1: 'rx-vlan-filter' is on. An 8021q upper
        on top of a bridged hellcreek port will call dsa_slave_vlan_rx_add_vid,
        and can again be vetoed through .port_prechangeupper.
      
      *It is not actually completely fine, because if I follow through
      correctly, we can have the following situation:
      
      ip link add br0 type bridge vlan_filtering 0
      ip link set lan0 master br0 # lan0 now becomes VLAN-unaware
      ip link set lan0 nomaster # lan0 fails to become VLAN-aware again, therefore breaking isolation
      
      This patch fixes that corner case by extending the DSA core logic, based
      on this requested attribute, to change the VLAN awareness state of the
      switch (port) when it leaves the bridge.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Acked-by: NKurt Kanzenbach <kurt@linutronix.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      58adf9dc
    • L
      mac80211: introduce individual TWT support in AP mode · f5a4c24e
      Lorenzo Bianconi 提交于
      Introduce TWT action frames parsing support to mac80211.
      Currently just individual TWT agreement are support in AP mode.
      Whenever the AP receives a TWT action frame from an associated client,
      after performing sanity checks, it will notify the underlay driver with
      requested parameters in order to check if they are supported and if there
      is enough room for a new agreement. The driver is expected to set the
      agreement result and report it to mac80211.
      
      Drivers supporting this have two new callbacks:
       - add_twt_setup (mandatory)
       - twt_teardown_request (optional)
      
      mac80211 will send an action frame reply according to the result
      reported by the driver.
      Tested-by: NPeter Chiu <chui-hao.chiu@mediatek.com>
      Signed-off-by: NLorenzo Bianconi <lorenzo@kernel.org>
      Link: https://lore.kernel.org/r/257512f2e22ba42b9f2624942a128dd8f141de4b.1629741512.git.lorenzo@kernel.org
      [use le16p_replace_bits(), minor cleanups, use (void *) casts,
       fix to use ieee80211_get_he_iftype_cap() correctly]
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      f5a4c24e
  7. 23 8月, 2021 1 次提交
    • V
      net: dsa: track unique bridge numbers across all DSA switch trees · f5e165e7
      Vladimir Oltean 提交于
      Right now, cross-tree bridging setups work somewhat by mistake.
      
      In the case of cross-tree bridging with sja1105, all switch instances
      need to agree upon a common VLAN ID for forwarding a packet that belongs
      to a certain bridging domain.
      
      With TX forwarding offload, the VLAN ID is the bridge VLAN for
      VLAN-aware bridging, and the tag_8021q TX forwarding offload VID
      (a VLAN which has non-zero VBID bits) for VLAN-unaware bridging.
      
      The VBID for VLAN-unaware bridging is derived from the dp->bridge_num
      value calculated by DSA independently for each switch tree.
      
      If ports from one tree join one bridge, and ports from another tree join
      another bridge, DSA will assign them the same bridge_num, even though
      the bridges are different. If cross-tree bridging is supported, this
      is an issue.
      
      Modify DSA to calculate the bridge_num globally across all switch trees.
      This has the implication for a driver that the dp->bridge_num value that
      DSA will assign to its ports might not be contiguous, if there are
      boards with multiple DSA drivers instantiated. Additionally, all
      bridge_num values eat up towards each switch's
      ds->num_fwd_offloading_bridges maximum, which is potentially unfortunate,
      and can be seen as a limitation introduced by this patch. However, that
      is the lesser evil for now.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f5e165e7
  8. 20 8月, 2021 1 次提交
  9. 19 8月, 2021 1 次提交
  10. 18 8月, 2021 1 次提交
  11. 17 8月, 2021 2 次提交
  12. 16 8月, 2021 1 次提交
  13. 14 8月, 2021 3 次提交
  14. 13 8月, 2021 1 次提交
    • K
      mac80211: Use flex-array for radiotap header bitmap · 8c89f7b3
      Kees Cook 提交于
      In preparation for FORTIFY_SOURCE performing compile-time and run-time
      field bounds checking for memcpy(), memmove(), and memset(), avoid
      intentionally writing across neighboring fields.
      
      The it_present member of struct ieee80211_radiotap_header is treated as a
      flexible array (multiple u32s can be conditionally present). In order for
      memcpy() to reason (or really, not reason) about the size of operations
      against this struct, use of bytes beyond it_present need to be treated
      as part of the flexible array. Add a trailing flexible array and
      initialize its initial index via pointer arithmetic.
      
      Cc: Johannes Berg <johannes@sipsolutions.net>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Jakub Kicinski <kuba@kernel.org>
      Cc: linux-wireless@vger.kernel.org
      Cc: netdev@vger.kernel.org
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Link: https://lore.kernel.org/r/20210806215305.2875621-1-keescook@chromium.orgSigned-off-by: NJohannes Berg <johannes.berg@intel.com>
      8c89f7b3
  15. 12 8月, 2021 2 次提交
  16. 11 8月, 2021 5 次提交
  17. 10 8月, 2021 6 次提交
    • F
      netfilter: nf_queue: move hookfn registration out of struct net · 87029970
      Florian Westphal 提交于
      This was done to detect when the pernet->init() function was not called
      yet, by checking if net->nf.queue_handler is NULL.
      
      Once the nfnetlink_queue module is active, all struct net pointers
      contain the same address.  So place this back in nf_queue.c.
      
      Handle the 'netns error unwind' test by checking nfnl_queue_net for a
      NULL pointer and add a comment for this.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      87029970
    • Y
      page_pool: add frag page recycling support in page pool · 53e0961d
      Yunsheng Lin 提交于
      Currently page pool only support page recycling when there
      is only one user of the page, and the split page reusing
      implemented in the most driver can not use the page pool as
      bing-pong way of reusing requires the multi user support in
      page pool.
      
      Those reusing or recycling has below limitations:
      1. page from page pool can only be used be one user in order
         for the page recycling to happen.
      2. Bing-pong way of reusing in most driver does not support
         multi desc using different part of the same page in order
         to save memory.
      
      So add multi-users support and frag page recycling in page
      pool to overcome the above limitation.
      Signed-off-by: NYunsheng Lin <linyunsheng@huawei.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      53e0961d
    • Y
      page_pool: add interface to manipulate frag count in page pool · 0e9d2a0a
      Yunsheng Lin 提交于
      For 32 bit systems with 64 bit dma, dma_addr[1] is used to
      store the upper 32 bit dma addr, those system should be rare
      those days.
      
      For normal system, the dma_addr[1] in 'struct page' is not
      used, so we can reuse dma_addr[1] for storing frag count,
      which means how many frags this page might be splited to.
      
      In order to simplify the page frag support in the page pool,
      the PAGE_POOL_DMA_USE_PP_FRAG_COUNT macro is added to indicate
      the 32 bit systems with 64 bit dma, and the page frag support
      in page pool is disabled for such system.
      
      The newly added page_pool_set_frag_count() is called to reserve
      the maximum frag count before any page frag is passed to the
      user. The page_pool_atomic_sub_frag_count_return() is called
      when user is done with the page frag.
      Signed-off-by: NYunsheng Lin <linyunsheng@huawei.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      0e9d2a0a
    • Y
      page_pool: keep pp info as long as page pool owns the page · 57f05bc2
      Yunsheng Lin 提交于
      Currently, page->pp is cleared and set everytime the page
      is recycled, which is unnecessary.
      
      So only set the page->pp when the page is added to the page
      pool and only clear it when the page is released from the
      page pool.
      
      This is also a preparation to support allocating frag page
      in page pool.
      Reviewed-by: NIlias Apalodimas <ilias.apalodimas@linaro.org>
      Signed-off-by: NYunsheng Lin <linyunsheng@huawei.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      57f05bc2
    • R
      psample: Add a fwd declaration for skbuff · beb7f2de
      Roi Dayan 提交于
      Without this there is a warning if source files include psample.h
      before skbuff.h or doesn't include it at all.
      
      Fixes: 6ae0a628 ("net: Introduce psample, a new genetlink channel for packet sampling")
      Signed-off-by: NRoi Dayan <roid@nvidia.com>
      Link: https://lore.kernel.org/r/20210808065242.1522535-1-roid@nvidia.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      beb7f2de
    • J
      net, bonding: Add XDP support to the bonding driver · 9e2ee5c7
      Jussi Maki 提交于
      XDP is implemented in the bonding driver by transparently delegating
      the XDP program loading, removal and xmit operations to the bonding
      slave devices. The overall goal of this work is that XDP programs
      can be attached to a bond device *without* any further changes (or
      awareness) necessary to the program itself, meaning the same XDP
      program can be attached to a native device but also a bonding device.
      
      Semantics of XDP_TX when attached to a bond are equivalent in such
      setting to the case when a tc/BPF program would be attached to the
      bond, meaning transmitting the packet out of the bond itself using one
      of the bond's configured xmit methods to select a slave device (rather
      than XDP_TX on the slave itself). Handling of XDP_TX to transmit
      using the configured bonding mechanism is therefore implemented by
      rewriting the BPF program return value in bpf_prog_run_xdp. To avoid
      performance impact this check is guarded by a static key, which is
      incremented when a XDP program is loaded onto a bond device. This
      approach was chosen to avoid changes to drivers implementing XDP. If
      the slave device does not match the receive device, then XDP_REDIRECT
      is transparently used to perform the redirection in order to have
      the network driver release the packet from its RX ring. The bonding
      driver hashing functions have been refactored to allow reuse with
      xdp_buff's to avoid code duplication.
      
      The motivation for this change is to enable use of bonding (and
      802.3ad) in hairpinning L4 load-balancers such as [1] implemented with
      XDP and also to transparently support bond devices for projects that
      use XDP given most modern NICs have dual port adapters. An alternative
      to this approach would be to implement 802.3ad in user-space and
      implement the bonding load-balancing in the XDP program itself, but
      is rather a cumbersome endeavor in terms of slave device management
      (e.g. by watching netlink) and requires separate programs for native
      vs bond cases for the orchestrator. A native in-kernel implementation
      overcomes these issues and provides more flexibility.
      
      Below are benchmark results done on two machines with 100Gbit
      Intel E810 (ice) NIC and with 32-core 3970X on sending machine, and
      16-core 3950X on receiving machine. 64 byte packets were sent with
      pktgen-dpdk at full rate. Two issues [2, 3] were identified with the
      ice driver, so the tests were performed with iommu=off and patch [2]
      applied. Additionally the bonding round robin algorithm was modified
      to use per-cpu tx counters as high CPU load (50% vs 10%) and high rate
      of cache misses were caused by the shared rr_tx_counter (see patch
      2/3). The statistics were collected using "sar -n dev -u 1 10". On top
      of that, for ice, further work is in progress on improving the XDP_TX
      numbers [4].
      
       -----------------------|  CPU  |--| rxpck/s |--| txpck/s |----
       without patch (1 dev):
         XDP_DROP:              3.15%      48.6Mpps
         XDP_TX:                3.12%      18.3Mpps     18.3Mpps
         XDP_DROP (RSS):        9.47%      116.5Mpps
         XDP_TX (RSS):          9.67%      25.3Mpps     24.2Mpps
       -----------------------
       with patch, bond (1 dev):
         XDP_DROP:              3.14%      46.7Mpps
         XDP_TX:                3.15%      13.9Mpps     13.9Mpps
         XDP_DROP (RSS):        10.33%     117.2Mpps
         XDP_TX (RSS):          10.64%     25.1Mpps     24.0Mpps
       -----------------------
       with patch, bond (2 devs):
         XDP_DROP:              6.27%      92.7Mpps
         XDP_TX:                6.26%      17.6Mpps     17.5Mpps
         XDP_DROP (RSS):       11.38%      117.2Mpps
         XDP_TX (RSS):         14.30%      28.7Mpps     27.4Mpps
       --------------------------------------------------------------
      
      RSS: Receive Side Scaling, e.g. the packets were sent to a range of
      destination IPs.
      
        [1]: https://cilium.io/blog/2021/05/20/cilium-110#standalonelb
        [2]: https://lore.kernel.org/bpf/20210601113236.42651-1-maciej.fijalkowski@intel.com/T/#t
        [3]: https://lore.kernel.org/bpf/CAHn8xckNXci+X_Eb2WMv4uVYjO2331UWB2JLtXr_58z0Av8+8A@mail.gmail.com/
        [4]: https://lore.kernel.org/bpf/20210805230046.28715-1-maciej.fijalkowski@intel.com/T/#tSigned-off-by: NJussi Maki <joamaki@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Cc: Jay Vosburgh <j.vosburgh@gmail.com>
      Cc: Veaceslav Falico <vfalico@gmail.com>
      Cc: Andy Gospodarek <andy@greyhouse.net>
      Cc: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
      Cc: Magnus Karlsson <magnus.karlsson@intel.com>
      Link: https://lore.kernel.org/bpf/20210731055738.16820-4-joamaki@gmail.com
      9e2ee5c7
  18. 09 8月, 2021 2 次提交
    • L
      devlink: Set device as early as possible · 919d13a7
      Leon Romanovsky 提交于
      All kernel devlink implementations call to devlink_alloc() during
      initialization routine for specific device which is used later as
      a parent device for devlink_register().
      
      Such late device assignment causes to the situation which requires us to
      call to device_register() before setting other parameters, but that call
      opens devlink to the world and makes accessible for the netlink users.
      
      Any attempt to move devlink_register() to be the last call generates the
      following error due to access to the devlink->dev pointer.
      
      [    8.758862]  devlink_nl_param_fill+0x2e8/0xe50
      [    8.760305]  devlink_param_notify+0x6d/0x180
      [    8.760435]  __devlink_params_register+0x2f1/0x670
      [    8.760558]  devlink_params_register+0x1e/0x20
      
      The simple change of API to set devlink device in the devlink_alloc()
      instead of devlink_register() fixes all this above and ensures that
      prior to call to devlink_register() everything already set.
      Signed-off-by: NLeon Romanovsky <leonro@nvidia.com>
      Reviewed-by: NJiri Pirko <jiri@nvidia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      919d13a7
    • V
      net: dsa: sja1105: rely on DSA core tracking of port learning state · 5313a37b
      Vladimir Oltean 提交于
      Now that DSA keeps track of the port learning state, it becomes
      superfluous to keep an additional variable with this information in the
      sja1105 driver. Remove it.
      
      The DSA core's learning state is present in struct dsa_port *dp.
      To avoid the antipattern where we iterate through a DSA switch's
      ports and then call dsa_to_port to obtain the "dp" reference (which is
      bad because dsa_to_port iterates through the DSA switch tree once
      again), just iterate through the dst->ports and operate on those
      directly.
      
      The sja1105 had an extra use of priv->learn_ena on non-user ports. DSA
      does not touch the learning state of those ports - drivers are free to
      do what they wish on them. Mark that information with a comment in
      struct dsa_port and let sja1105 set dp->learning for cascade ports.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5313a37b