1. 10 8月, 2021 5 次提交
    • Y
      page_pool: add frag page recycling support in page pool · 53e0961d
      Yunsheng Lin 提交于
      Currently page pool only support page recycling when there
      is only one user of the page, and the split page reusing
      implemented in the most driver can not use the page pool as
      bing-pong way of reusing requires the multi user support in
      page pool.
      
      Those reusing or recycling has below limitations:
      1. page from page pool can only be used be one user in order
         for the page recycling to happen.
      2. Bing-pong way of reusing in most driver does not support
         multi desc using different part of the same page in order
         to save memory.
      
      So add multi-users support and frag page recycling in page
      pool to overcome the above limitation.
      Signed-off-by: NYunsheng Lin <linyunsheng@huawei.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      53e0961d
    • Y
      page_pool: add interface to manipulate frag count in page pool · 0e9d2a0a
      Yunsheng Lin 提交于
      For 32 bit systems with 64 bit dma, dma_addr[1] is used to
      store the upper 32 bit dma addr, those system should be rare
      those days.
      
      For normal system, the dma_addr[1] in 'struct page' is not
      used, so we can reuse dma_addr[1] for storing frag count,
      which means how many frags this page might be splited to.
      
      In order to simplify the page frag support in the page pool,
      the PAGE_POOL_DMA_USE_PP_FRAG_COUNT macro is added to indicate
      the 32 bit systems with 64 bit dma, and the page frag support
      in page pool is disabled for such system.
      
      The newly added page_pool_set_frag_count() is called to reserve
      the maximum frag count before any page frag is passed to the
      user. The page_pool_atomic_sub_frag_count_return() is called
      when user is done with the page frag.
      Signed-off-by: NYunsheng Lin <linyunsheng@huawei.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      0e9d2a0a
    • Y
      page_pool: keep pp info as long as page pool owns the page · 57f05bc2
      Yunsheng Lin 提交于
      Currently, page->pp is cleared and set everytime the page
      is recycled, which is unnecessary.
      
      So only set the page->pp when the page is added to the page
      pool and only clear it when the page is released from the
      page pool.
      
      This is also a preparation to support allocating frag page
      in page pool.
      Reviewed-by: NIlias Apalodimas <ilias.apalodimas@linaro.org>
      Signed-off-by: NYunsheng Lin <linyunsheng@huawei.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      57f05bc2
    • J
      net, bonding: Add XDP support to the bonding driver · 9e2ee5c7
      Jussi Maki 提交于
      XDP is implemented in the bonding driver by transparently delegating
      the XDP program loading, removal and xmit operations to the bonding
      slave devices. The overall goal of this work is that XDP programs
      can be attached to a bond device *without* any further changes (or
      awareness) necessary to the program itself, meaning the same XDP
      program can be attached to a native device but also a bonding device.
      
      Semantics of XDP_TX when attached to a bond are equivalent in such
      setting to the case when a tc/BPF program would be attached to the
      bond, meaning transmitting the packet out of the bond itself using one
      of the bond's configured xmit methods to select a slave device (rather
      than XDP_TX on the slave itself). Handling of XDP_TX to transmit
      using the configured bonding mechanism is therefore implemented by
      rewriting the BPF program return value in bpf_prog_run_xdp. To avoid
      performance impact this check is guarded by a static key, which is
      incremented when a XDP program is loaded onto a bond device. This
      approach was chosen to avoid changes to drivers implementing XDP. If
      the slave device does not match the receive device, then XDP_REDIRECT
      is transparently used to perform the redirection in order to have
      the network driver release the packet from its RX ring. The bonding
      driver hashing functions have been refactored to allow reuse with
      xdp_buff's to avoid code duplication.
      
      The motivation for this change is to enable use of bonding (and
      802.3ad) in hairpinning L4 load-balancers such as [1] implemented with
      XDP and also to transparently support bond devices for projects that
      use XDP given most modern NICs have dual port adapters. An alternative
      to this approach would be to implement 802.3ad in user-space and
      implement the bonding load-balancing in the XDP program itself, but
      is rather a cumbersome endeavor in terms of slave device management
      (e.g. by watching netlink) and requires separate programs for native
      vs bond cases for the orchestrator. A native in-kernel implementation
      overcomes these issues and provides more flexibility.
      
      Below are benchmark results done on two machines with 100Gbit
      Intel E810 (ice) NIC and with 32-core 3970X on sending machine, and
      16-core 3950X on receiving machine. 64 byte packets were sent with
      pktgen-dpdk at full rate. Two issues [2, 3] were identified with the
      ice driver, so the tests were performed with iommu=off and patch [2]
      applied. Additionally the bonding round robin algorithm was modified
      to use per-cpu tx counters as high CPU load (50% vs 10%) and high rate
      of cache misses were caused by the shared rr_tx_counter (see patch
      2/3). The statistics were collected using "sar -n dev -u 1 10". On top
      of that, for ice, further work is in progress on improving the XDP_TX
      numbers [4].
      
       -----------------------|  CPU  |--| rxpck/s |--| txpck/s |----
       without patch (1 dev):
         XDP_DROP:              3.15%      48.6Mpps
         XDP_TX:                3.12%      18.3Mpps     18.3Mpps
         XDP_DROP (RSS):        9.47%      116.5Mpps
         XDP_TX (RSS):          9.67%      25.3Mpps     24.2Mpps
       -----------------------
       with patch, bond (1 dev):
         XDP_DROP:              3.14%      46.7Mpps
         XDP_TX:                3.15%      13.9Mpps     13.9Mpps
         XDP_DROP (RSS):        10.33%     117.2Mpps
         XDP_TX (RSS):          10.64%     25.1Mpps     24.0Mpps
       -----------------------
       with patch, bond (2 devs):
         XDP_DROP:              6.27%      92.7Mpps
         XDP_TX:                6.26%      17.6Mpps     17.5Mpps
         XDP_DROP (RSS):       11.38%      117.2Mpps
         XDP_TX (RSS):         14.30%      28.7Mpps     27.4Mpps
       --------------------------------------------------------------
      
      RSS: Receive Side Scaling, e.g. the packets were sent to a range of
      destination IPs.
      
        [1]: https://cilium.io/blog/2021/05/20/cilium-110#standalonelb
        [2]: https://lore.kernel.org/bpf/20210601113236.42651-1-maciej.fijalkowski@intel.com/T/#t
        [3]: https://lore.kernel.org/bpf/CAHn8xckNXci+X_Eb2WMv4uVYjO2331UWB2JLtXr_58z0Av8+8A@mail.gmail.com/
        [4]: https://lore.kernel.org/bpf/20210805230046.28715-1-maciej.fijalkowski@intel.com/T/#tSigned-off-by: NJussi Maki <joamaki@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Cc: Jay Vosburgh <j.vosburgh@gmail.com>
      Cc: Veaceslav Falico <vfalico@gmail.com>
      Cc: Andy Gospodarek <andy@greyhouse.net>
      Cc: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
      Cc: Magnus Karlsson <magnus.karlsson@intel.com>
      Link: https://lore.kernel.org/bpf/20210731055738.16820-4-joamaki@gmail.com
      9e2ee5c7
    • J
      net, core: Add support for XDP redirection to slave device · 879af96f
      Jussi Maki 提交于
      This adds the ndo_xdp_get_xmit_slave hook for transforming XDP_TX
      into XDP_REDIRECT after BPF program run when the ingress device
      is a bond slave.
      
      The dev_xdp_prog_count is exposed so that slave devices can be checked
      for loaded XDP programs in order to avoid the situation where both
      bond master and slave have programs loaded according to xdp_state.
      Signed-off-by: NJussi Maki <joamaki@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Cc: Jay Vosburgh <j.vosburgh@gmail.com>
      Cc: Veaceslav Falico <vfalico@gmail.com>
      Cc: Andy Gospodarek <andy@greyhouse.net>
      Link: https://lore.kernel.org/bpf/20210731055738.16820-3-joamaki@gmail.com
      879af96f
  2. 09 8月, 2021 3 次提交
    • L
      devlink: Set device as early as possible · 919d13a7
      Leon Romanovsky 提交于
      All kernel devlink implementations call to devlink_alloc() during
      initialization routine for specific device which is used later as
      a parent device for devlink_register().
      
      Such late device assignment causes to the situation which requires us to
      call to device_register() before setting other parameters, but that call
      opens devlink to the world and makes accessible for the netlink users.
      
      Any attempt to move devlink_register() to be the last call generates the
      following error due to access to the devlink->dev pointer.
      
      [    8.758862]  devlink_nl_param_fill+0x2e8/0xe50
      [    8.760305]  devlink_param_notify+0x6d/0x180
      [    8.760435]  __devlink_params_register+0x2f1/0x670
      [    8.760558]  devlink_params_register+0x1e/0x20
      
      The simple change of API to set devlink device in the devlink_alloc()
      instead of devlink_register() fixes all this above and ensures that
      prior to call to devlink_register() everything already set.
      Signed-off-by: NLeon Romanovsky <leonro@nvidia.com>
      Reviewed-by: NJiri Pirko <jiri@nvidia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      919d13a7
    • V
      net: dsa: sja1105: rely on DSA core tracking of port learning state · 5313a37b
      Vladimir Oltean 提交于
      Now that DSA keeps track of the port learning state, it becomes
      superfluous to keep an additional variable with this information in the
      sja1105 driver. Remove it.
      
      The DSA core's learning state is present in struct dsa_port *dp.
      To avoid the antipattern where we iterate through a DSA switch's
      ports and then call dsa_to_port to obtain the "dp" reference (which is
      bad because dsa_to_port iterates through the DSA switch tree once
      again), just iterate through the dst->ports and operate on those
      directly.
      
      The sja1105 had an extra use of priv->learn_ena on non-user ports. DSA
      does not touch the learning state of those ports - drivers are free to
      do what they wish on them. Mark that information with a comment in
      struct dsa_port and let sja1105 set dp->learning for cascade ports.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5313a37b
    • V
      net: dsa: centralize fast ageing when address learning is turned off · 045c45d1
      Vladimir Oltean 提交于
      Currently DSA leaves it down to device drivers to fast age the FDB on a
      port when address learning is disabled on it. There are 2 reasons for
      doing that in the first place:
      
      - when address learning is disabled by user space, through
        IFLA_BRPORT_LEARNING or the brport_attr_learning sysfs, what user
        space typically wants to achieve is to operate in a mode with no
        dynamic FDB entry on that port. But if the port is already up, some
        addresses might have been already learned on it, and it seems silly to
        wait for 5 minutes for them to expire until something useful can be
        done.
      
      - when a port leaves a bridge and becomes standalone, DSA turns off
        address learning on it. This also has the nice side effect of flushing
        the dynamically learned bridge FDB entries on it, which is a good idea
        because standalone ports should not have bridge FDB entries on them.
      
      We let drivers manage fast ageing under this condition because if DSA
      were to do it, it would need to track each port's learning state, and
      act upon the transition, which it currently doesn't.
      
      But there are 2 reasons why doing it is better after all:
      
      - drivers might get it wrong and not do it (see b53_port_set_learning)
      
      - we would like to flush the dynamic entries from the software bridge
        too, and letting drivers do that would be another pain point
      
      So track the port learning state and trigger a fast age process
      automatically within DSA.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      045c45d1
  3. 08 8月, 2021 1 次提交
  4. 06 8月, 2021 2 次提交
    • V
      net: dsa: don't disable multicast flooding to the CPU even without an IGMP querier · c73c5708
      Vladimir Oltean 提交于
      Commit 08cc83cc ("net: dsa: add support for BRIDGE_MROUTER
      attribute") added an option for users to turn off multicast flooding
      towards the CPU if they turn off the IGMP querier on a bridge which
      already has enslaved ports (echo 0 > /sys/class/net/br0/bridge/multicast_router).
      
      And commit a8b659e7 ("net: dsa: act as passthrough for bridge port flags")
      simply papered over that issue, because it moved the decision to flood
      the CPU with multicast (or not) from the DSA core down to individual drivers,
      instead of taking a more radical position then.
      
      The truth is that disabling multicast flooding to the CPU is simply
      something we are not prepared to do now, if at all. Some reasons:
      
      - ICMP6 neighbor solicitation messages are unregistered multicast
        packets as far as the bridge is concerned. So if we stop flooding
        multicast, the outside world cannot ping the bridge device's IPv6
        link-local address.
      
      - There might be foreign interfaces bridged with our DSA switch ports
        (sending a packet towards the host does not necessarily equal
        termination, but maybe software forwarding). So if there is no one
        interested in that multicast traffic in the local network stack, that
        doesn't mean nobody is.
      
      - PTP over L4 (IPv4, IPv6) is multicast, but is unregistered as far as
        the bridge is concerned. This should reach the CPU port.
      
      - The switch driver might not do FDB partitioning. And since we don't
        even bother to do more fine-grained flood disabling (such as "disable
        flooding _from_port_N_ towards the CPU port" as opposed to "disable
        flooding _from_any_port_ towards the CPU port"), this breaks standalone
        ports, or even multiple bridges where one has an IGMP querier and one
        doesn't.
      
      Reverting the logic makes all of the above work.
      
      Fixes: a8b659e7 ("net: dsa: act as passthrough for bridge port flags")
      Fixes: 08cc83cc ("net: dsa: add support for BRIDGE_MROUTER attribute")
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c73c5708
    • T
      Bluetooth: defer cleanup of resources in hci_unregister_dev() · e0448092
      Tetsuo Handa 提交于
      syzbot is hitting might_sleep() warning at hci_sock_dev_event() due to
      calling lock_sock() with rw spinlock held [1].
      
      It seems that history of this locking problem is a trial and error.
      
      Commit b40df574 ("[PATCH] bluetooth: fix socket locking in
      hci_sock_dev_event()") in 2.6.21-rc4 changed bh_lock_sock() to
      lock_sock() as an attempt to fix lockdep warning.
      
      Then, commit 4ce61d1c ("[BLUETOOTH]: Fix locking in
      hci_sock_dev_event().") in 2.6.22-rc2 changed lock_sock() to
      local_bh_disable() + bh_lock_sock_nested() as an attempt to fix the
      sleep in atomic context warning.
      
      Then, commit 4b5dd696 ("Bluetooth: Remove local_bh_disable() from
      hci_sock.c") in 3.3-rc1 removed local_bh_disable().
      
      Then, commit e305509e ("Bluetooth: use correct lock to prevent UAF
      of hdev object") in 5.13-rc5 again changed bh_lock_sock_nested() to
      lock_sock() as an attempt to fix CVE-2021-3573.
      
      This difficulty comes from current implementation that
      hci_sock_dev_event(HCI_DEV_UNREG) is responsible for dropping all
      references from sockets because hci_unregister_dev() immediately
      reclaims resources as soon as returning from
      hci_sock_dev_event(HCI_DEV_UNREG).
      
      But the history suggests that hci_sock_dev_event(HCI_DEV_UNREG) was not
      doing what it should do.
      
      Therefore, instead of trying to detach sockets from device, let's accept
      not detaching sockets from device at hci_sock_dev_event(HCI_DEV_UNREG),
      by moving actual cleanup of resources from hci_unregister_dev() to
      hci_cleanup_dev() which is called by bt_host_release() when all
      references to this unregistered device (which is a kobject) are gone.
      
      Since hci_sock_dev_event(HCI_DEV_UNREG) no longer resets
      hci_pi(sk)->hdev, we need to check whether this device was unregistered
      and return an error based on HCI_UNREGISTER flag.  There might be subtle
      behavioral difference in "monitor the hdev" functionality; please report
      if you found something went wrong due to this patch.
      
      Link: https://syzkaller.appspot.com/bug?extid=a5df189917e79d5e59c9 [1]
      Reported-by: Nsyzbot <syzbot+a5df189917e79d5e59c9@syzkaller.appspotmail.com>
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Fixes: e305509e ("Bluetooth: use correct lock to prevent UAF of hdev object")
      Acked-by: NLuiz Augusto von Dentz <luiz.von.dentz@intel.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e0448092
  5. 05 8月, 2021 4 次提交
    • Y
      netdevice: add the case if dev is NULL · b37a4668
      Yajun Deng 提交于
      Add the case if dev is NULL in dev_{put, hold}, so the caller doesn't
      need to care whether dev is NULL or not.
      Signed-off-by: NYajun Deng <yajun.deng@linux.dev>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b37a4668
    • G
      net/ipv6/mcast: Use struct_size() helper · e11c0e25
      Gustavo A. R. Silva 提交于
      Replace IP6_SFLSIZE() with struct_size() helper in order to avoid any
      potential type mistakes or integer overflows that, in the worst
      scenario, could lead to heap overflows.
      Signed-off-by: NGustavo A. R. Silva <gustavoars@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e11c0e25
    • G
      net/ipv4/igmp: Use struct_size() helper · e6a1f7e0
      Gustavo A. R. Silva 提交于
      Replace IP_SFLSIZE() with struct_size() helper in order to avoid any
      potential type mistakes or integer overflows that, in the worst
      scenario, could lead to heap overflows.
      Signed-off-by: NGustavo A. R. Silva <gustavoars@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e6a1f7e0
    • G
      net/ipv4/ipv6: Replace one-element arraya with flexible-array members · db243b79
      Gustavo A. R. Silva 提交于
      There is a regular need in the kernel to provide a way to declare having
      a dynamically sized set of trailing elements in a structure. Kernel code
      should always use “flexible array members”[1] for these cases. The older
      style of one-element or zero-length arrays should no longer be used[2].
      
      Use an anonymous union with a couple of anonymous structs in order to
      keep userspace unchanged and refactor the related code accordingly:
      
      $ pahole -C group_filter net/ipv4/ip_sockglue.o
      struct group_filter {
      	union {
      		struct {
      			__u32      gf_interface_aux;     /*     0     4 */
      
      			/* XXX 4 bytes hole, try to pack */
      
      			struct __kernel_sockaddr_storage gf_group_aux; /*     8   128 */
      			/* --- cacheline 2 boundary (128 bytes) was 8 bytes ago --- */
      			__u32      gf_fmode_aux;         /*   136     4 */
      			__u32      gf_numsrc_aux;        /*   140     4 */
      			struct __kernel_sockaddr_storage gf_slist[1]; /*   144   128 */
      		};                                       /*     0   272 */
      		struct {
      			__u32      gf_interface;         /*     0     4 */
      
      			/* XXX 4 bytes hole, try to pack */
      
      			struct __kernel_sockaddr_storage gf_group; /*     8   128 */
      			/* --- cacheline 2 boundary (128 bytes) was 8 bytes ago --- */
      			__u32      gf_fmode;             /*   136     4 */
      			__u32      gf_numsrc;            /*   140     4 */
      			struct __kernel_sockaddr_storage gf_slist_flex[0]; /*   144     0 */
      		};                                       /*     0   144 */
      	};                                               /*     0   272 */
      
      	/* size: 272, cachelines: 5, members: 1 */
      	/* last cacheline: 16 bytes */
      };
      
      $ pahole -C compat_group_filter net/ipv4/ip_sockglue.o
      struct compat_group_filter {
      	union {
      		struct {
      			__u32      gf_interface_aux;     /*     0     4 */
      			struct __kernel_sockaddr_storage gf_group_aux __attribute__((__aligned__(4))); /*     4   128 */
      			/* --- cacheline 2 boundary (128 bytes) was 4 bytes ago --- */
      			__u32      gf_fmode_aux;         /*   132     4 */
      			__u32      gf_numsrc_aux;        /*   136     4 */
      			struct __kernel_sockaddr_storage gf_slist[1] __attribute__((__aligned__(4))); /*   140   128 */
      		} __attribute__((__packed__)) __attribute__((__aligned__(4)));                     /*     0   268 */
      		struct {
      			__u32      gf_interface;         /*     0     4 */
      			struct __kernel_sockaddr_storage gf_group __attribute__((__aligned__(4))); /*     4   128 */
      			/* --- cacheline 2 boundary (128 bytes) was 4 bytes ago --- */
      			__u32      gf_fmode;             /*   132     4 */
      			__u32      gf_numsrc;            /*   136     4 */
      			struct __kernel_sockaddr_storage gf_slist_flex[0] __attribute__((__aligned__(4))); /*   140     0 */
      		} __attribute__((__packed__)) __attribute__((__aligned__(4)));                     /*     0   140 */
      	} __attribute__((__aligned__(1)));               /*     0   268 */
      
      	/* size: 268, cachelines: 5, members: 1 */
      	/* forced alignments: 1 */
      	/* last cacheline: 12 bytes */
      } __attribute__((__packed__));
      
      This helps with the ongoing efforts to globally enable -Warray-bounds
      and get us closer to being able to tighten the FORTIFY_SOURCE routines
      on memcpy().
      
      [1] https://en.wikipedia.org/wiki/Flexible_array_member
      [2] https://www.kernel.org/doc/html/v5.10/process/deprecated.html#zero-length-and-one-element-arrays
      
      Link: https://github.com/KSPP/linux/issues/79
      Link: https://github.com/KSPP/linux/issues/109Signed-off-by: NGustavo A. R. Silva <gustavoars@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      db243b79
  6. 04 8月, 2021 7 次提交
    • P
      sock: allow reading and changing sk_userlocks with setsockopt · 04190bf8
      Pavel Tikhomirov 提交于
      SOCK_SNDBUF_LOCK and SOCK_RCVBUF_LOCK flags disable automatic socket
      buffers adjustment done by kernel (see tcp_fixup_rcvbuf() and
      tcp_sndbuf_expand()). If we've just created a new socket this adjustment
      is enabled on it, but if one changes the socket buffer size by
      setsockopt(SO_{SND,RCV}BUF*) it becomes disabled.
      
      CRIU needs to call setsockopt(SO_{SND,RCV}BUF*) on each socket on
      restore as it first needs to increase buffer sizes for packet queues
      restore and second it needs to restore back original buffer sizes. So
      after CRIU restore all sockets become non-auto-adjustable, which can
      decrease network performance of restored applications significantly.
      
      CRIU need to be able to restore sockets with enabled/disabled adjustment
      to the same state it was before dump, so let's add special setsockopt
      for it.
      
      Let's also export SOCK_SNDBUF_LOCK and SOCK_RCVBUF_LOCK flags to uAPI so
      that using these interface one can reenable automatic socket buffer
      adjustment on their sockets.
      Signed-off-by: NPavel Tikhomirov <ptikhomirov@virtuozzo.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      04190bf8
    • V
      net: make switchdev_bridge_port_{,unoffload} loosely coupled with the bridge · 957e2235
      Vladimir Oltean 提交于
      With the introduction of explicit offloading API in switchdev in commit
      2f5dc00f ("net: bridge: switchdev: let drivers inform which bridge
      ports are offloaded"), we started having Ethernet switch drivers calling
      directly into a function exported by net/bridge/br_switchdev.c, which is
      a function exported by the bridge driver.
      
      This means that drivers that did not have an explicit dependency on the
      bridge before, like cpsw and am65-cpsw, now do - otherwise it is not
      possible to call a symbol exported by a driver that can be built as
      module unless you are a module too.
      
      There was an attempt to solve the dependency issue in the form of commit
      b0e81817 ("net: build all switchdev drivers as modules when the
      bridge is a module"). Grygorii Strashko, however, says about it:
      
      | In my opinion, the problem is a bit bigger here than just fixing the
      | build :(
      |
      | In case, of ^cpsw the switchdev mode is kinda optional and in many
      | cases (especially for testing purposes, NFS) the multi-mac mode is
      | still preferable mode.
      |
      | There were no such tight dependency between switchdev drivers and
      | bridge core before and switchdev serviced as independent, notification
      | based layer between them, so ^cpsw still can be "Y" and bridge can be
      | "M". Now for mostly every kernel build configuration the CONFIG_BRIDGE
      | will need to be set as "Y", or we will have to update drivers to
      | support build with BRIDGE=n and maintain separate builds for
      | networking vs non-networking testing.  But is this enough?  Wouldn't
      | it cause 'chain reaction' required to add more and more "Y" options
      | (like CONFIG_VLAN_8021Q)?
      |
      | PS. Just to be sure we on the same page - ARM builds will be forced
      | (with this patch) to have CONFIG_TI_CPSW_SWITCHDEV=m and so all our
      | automation testing will just fail with omap2plus_defconfig.
      
      In the light of this, it would be desirable for some configurations to
      avoid dependencies between switchdev drivers and the bridge, and have
      the switchdev mode as completely optional within the driver.
      
      Arnd Bergmann also tried to write a patch which better expressed the
      build time dependency for Ethernet switch drivers where the switchdev
      support is optional, like cpsw/am65-cpsw, and this made the drivers
      follow the bridge (compile as module if the bridge is a module) only if
      the optional switchdev support in the driver was enabled in the first
      place:
      https://patchwork.kernel.org/project/netdevbpf/patch/20210802144813.1152762-1-arnd@kernel.org/
      
      but this still did not solve the fact that cpsw and am65-cpsw now must
      be built as modules when the bridge is a module - it just expressed
      correctly that optional dependency. But the new behavior is an apparent
      regression from Grygorii's perspective.
      
      So to support the use case where the Ethernet driver is built-in,
      NET_SWITCHDEV (a bool option) is enabled, and the bridge is a module, we
      need a framework that can handle the possible absence of the bridge from
      the running system, i.e. runtime bloatware as opposed to build-time
      bloatware.
      
      Luckily we already have this framework, since switchdev has been using
      it extensively. Events from the bridge side are transmitted to the
      driver side using notifier chains - this was originally done so that
      unrelated drivers could snoop for events emitted by the bridge towards
      ports that are implemented by other drivers (think of a switch driver
      with LAG offload that listens for switchdev events on a bonding/team
      interface that it offloads).
      
      There are also events which are transmitted from the driver side to the
      bridge side, which again are modeled using notifiers.
      SWITCHDEV_FDB_ADD_TO_BRIDGE is an example of this, and deals with
      notifying the bridge that a MAC address has been dynamically learned.
      So there is a precedent we can use for modeling the new framework.
      
      The difference compared to SWITCHDEV_FDB_ADD_TO_BRIDGE is that the work
      that the bridge needs to do when a port becomes offloaded is blocking in
      its nature: replay VLANs, MDBs etc. The calling context is indeed
      blocking (we are under rtnl_mutex), but the existing switchdev
      notification chain that the bridge is subscribed to is only the atomic
      one. So we need to subscribe the bridge to the blocking switchdev
      notification chain too.
      
      This patch:
      - keeps the driver-side perception of the switchdev_bridge_port_{,un}offload
        unchanged
      - moves the implementation of switchdev_bridge_port_{,un}offload from
        the bridge module into the switchdev module.
      - makes everybody that is subscribed to the switchdev blocking notifier
        chain "hear" offload & unoffload events
      - makes the bridge driver subscribe and handle those events
      - moves the bridge driver's handling of those events into 2 new
        functions called br_switchdev_port_{,un}offload. These functions
        contain in fact the core of the logic that was previously in
        switchdev_bridge_port_{,un}offload, just that now we go through an
        extra indirection layer to reach them.
      
      Unlike all the other switchdev notification structures, the structure
      used to carry the bridge port information, struct
      switchdev_notifier_brport_info, does not contain a "bool handled".
      This is because in the current usage pattern, we always know that a
      switchdev bridge port offloading event will be handled by the bridge,
      because the switchdev_bridge_port_offload() call was initiated by a
      NETDEV_CHANGEUPPER event in the first place, where info->upper_dev is a
      bridge. So if the bridge wasn't loaded, then the CHANGEUPPER event
      couldn't have happened.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Tested-by: NGrygorii Strashko <grygorii.strashko@ti.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      957e2235
    • O
      can: j1939: extend UAPI to notify about RX status · 5b9272e9
      Oleksij Rempel 提交于
      To be able to create applications with user friendly feedback, we need be
      able to provide receive status information.
      
      Typical ETP transfer may take seconds or even hours. To give user some
      clue or show a progress bar, the stack should push status updates.
      Same as for the TX information, the socket error queue will be used with
      following new signals:
      - J1939_EE_INFO_RX_RTS   - received and accepted request to send signal.
      - J1939_EE_INFO_RX_DPO   - received data package offset signal
      - J1939_EE_INFO_RX_ABORT - RX session was aborted
      
      Instead of completion signal, user will get data package.
      To activate this signals, application should set
      SOF_TIMESTAMPING_RX_SOFTWARE to the SO_TIMESTAMPING socket option. This
      will avoid unpredictable application behavior for the old software.
      
      Link: https://lore.kernel.org/r/20210707094854.30781-3-o.rempel@pengutronix.deSigned-off-by: NOleksij Rempel <o.rempel@pengutronix.de>
      Signed-off-by: NMarc Kleine-Budde <mkl@pengutronix.de>
      5b9272e9
    • J
      net: add netif_set_real_num_queues() for device reconfig · 271e5b7d
      Jakub Kicinski 提交于
      netif_set_real_num_rx_queues() and netif_set_real_num_tx_queues()
      can fail which breaks drivers trying to implement reconfiguration
      in a way that can't leave the device half-broken. In other words
      those functions are incompatible with prepare/commit approach.
      
      Luckily setting real number of queues can fail only if the number
      is increased, meaning that if we order operations correctly we
      can guarantee ending up with either new config (success), or
      the old one (on error).
      
      Provide a helper implementing such logic so that drivers don't
      have to duplicate it.
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      271e5b7d
    • R
      net: add extack arg for link ops · 8679c31e
      Rocco Yue 提交于
      Pass extack arg to validate_linkmsg and validate_link_af callbacks.
      If a netlink attribute has a reject_message, use the extended ack
      mechanism to carry the message back to user space.
      Signed-off-by: NRocco Yue <rocco.yue@mediatek.com>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8679c31e
    • R
      af_unix: Add OOB support · 314001f0
      Rao Shoaib 提交于
      This patch adds OOB support for AF_UNIX sockets.
      The semantics is same as TCP.
      
      The last byte of a message with the OOB flag is
      treated as the OOB byte. The byte is separated into
      a skb and a pointer to the skb is stored in unix_sock.
      The pointer is used to enforce OOB semantics.
      Signed-off-by: NRao Shoaib <rao.shoaib@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      314001f0
    • I
      bus: fsl-mc: extend fsl_mc_get_endpoint() to pass interface ID · 27cfdadd
      Ioana Ciornei 提交于
      In case of a switch DPAA2 object, the interface ID is also needed when
      querying for the object endpoint. Extend fsl_mc_get_endpoint() so that
      users can also pass the interface ID that are interested in.
      Signed-off-by: NIoana Ciornei <ioana.ciornei@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      27cfdadd
  7. 03 8月, 2021 12 次提交
  8. 02 8月, 2021 6 次提交