1. 03 3月, 2022 4 次提交
    • P
      net: rtnetlink: Propagate extack to rtnl_offload_xstats_fill() · 05415bcc
      Petr Machata 提交于
      Later patches add handlers for more HW-backed statistics. An extack will be
      useful when communicating HW / driver errors to the client. Add the
      arguments as appropriate.
      Signed-off-by: NPetr Machata <petrm@nvidia.com>
      Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      05415bcc
    • P
      net: rtnetlink: RTM_GETSTATS: Allow filtering inside nests · 46efc97b
      Petr Machata 提交于
      The filter_mask field of RTM_GETSTATS header determines which top-level
      attributes should be included in the netlink response. This saves
      processing time by only including the bits that the user cares about
      instead of always dumping everything. This is doubly important for
      HW-backed statistics that would typically require a trip to the device to
      fetch the stats.
      
      So far there was only one HW-backed stat suite per attribute. However,
      IFLA_STATS_LINK_OFFLOAD_XSTATS is a nest, and will gain a new stat suite in
      the following patches. It would therefore be advantageous to be able to
      filter within that nest, and select just one or the other HW-backed
      statistics suite.
      
      Extend rtnetlink so that RTM_GETSTATS permits attributes in the payload.
      The scheme is as follows:
      
          - RTM_GETSTATS
      	- struct if_stats_msg
      	- attr nest IFLA_STATS_GET_FILTERS
      	    - attr IFLA_STATS_LINK_OFFLOAD_XSTATS
      		- u32 filter_mask
      
      This scheme reuses the existing enumerators by nesting them in a dedicated
      context attribute. This is covered by policies as usual, therefore a
      gradual opt-in is possible. Currently only IFLA_STATS_LINK_OFFLOAD_XSTATS
      nest has filtering enabled, because for the SW counters the issue does not
      seem to be that important.
      
      rtnl_offload_xstats_get_size() and _fill() are extended to observe the
      requested filters.
      Signed-off-by: NPetr Machata <petrm@nvidia.com>
      Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      46efc97b
    • P
      net: rtnetlink: Stop assuming that IFLA_OFFLOAD_XSTATS_* are dev-backed · f6e0fb81
      Petr Machata 提交于
      The IFLA_STATS_LINK_OFFLOAD_XSTATS attribute is a nest whose child
      attributes carry various special hardware statistics. The code that handles
      this nest was written with the idea that all these statistics would be
      exposed by the device driver of a physical netdevice.
      
      In the following patches, a new attribute is added to the abovementioned
      nest, which however can be defined for some soft netdevices. The NDO-based
      approach to querying these does not work, because it is not the soft
      netdevice driver that exposes these statistics, but an offloading NIC
      driver that does so.
      
      The current code does not scale well to this usage. Simply rewrite it back
      to the pattern seen in other fill-like and get_size-like functions
      elsewhere.
      
      Extract to helpers the code that is concerned with handling specifically
      NDO-backed statistics so that it can be easily reused should more such
      statistics be added.
      Signed-off-by: NPetr Machata <petrm@nvidia.com>
      Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f6e0fb81
    • P
      net: rtnetlink: Namespace functions related to IFLA_OFFLOAD_XSTATS_* · 6b524a1d
      Petr Machata 提交于
      The currently used names rtnl_get_offload_stats() and
      rtnl_get_offload_stats_size() do not clearly show the namespace. The former
      function additionally seems to have been named this way in accordance with
      the NDO name, as opposed to the naming used in the rtnetlink.c file (and
      indeed elsewhere in the netlink handling code). As more and
      differently-flavored attributes are introduced, a common clear prefix is
      needed for all related functions.
      
      Rename the functions to follow the rtnl_offload_xstats_* naming scheme.
      Signed-off-by: NPetr Machata <petrm@nvidia.com>
      Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6b524a1d
  2. 17 2月, 2022 1 次提交
  3. 14 2月, 2022 1 次提交
    • E
      net_sched: add __rcu annotation to netdev->qdisc · 5891cd5e
      Eric Dumazet 提交于
      syzbot found a data-race [1] which lead me to add __rcu
      annotations to netdev->qdisc, and proper accessors
      to get LOCKDEP support.
      
      [1]
      BUG: KCSAN: data-race in dev_activate / qdisc_lookup_rcu
      
      write to 0xffff888168ad6410 of 8 bytes by task 13559 on cpu 1:
       attach_default_qdiscs net/sched/sch_generic.c:1167 [inline]
       dev_activate+0x2ed/0x8f0 net/sched/sch_generic.c:1221
       __dev_open+0x2e9/0x3a0 net/core/dev.c:1416
       __dev_change_flags+0x167/0x3f0 net/core/dev.c:8139
       rtnl_configure_link+0xc2/0x150 net/core/rtnetlink.c:3150
       __rtnl_newlink net/core/rtnetlink.c:3489 [inline]
       rtnl_newlink+0xf4d/0x13e0 net/core/rtnetlink.c:3529
       rtnetlink_rcv_msg+0x745/0x7e0 net/core/rtnetlink.c:5594
       netlink_rcv_skb+0x14e/0x250 net/netlink/af_netlink.c:2494
       rtnetlink_rcv+0x18/0x20 net/core/rtnetlink.c:5612
       netlink_unicast_kernel net/netlink/af_netlink.c:1317 [inline]
       netlink_unicast+0x602/0x6d0 net/netlink/af_netlink.c:1343
       netlink_sendmsg+0x728/0x850 net/netlink/af_netlink.c:1919
       sock_sendmsg_nosec net/socket.c:705 [inline]
       sock_sendmsg net/socket.c:725 [inline]
       ____sys_sendmsg+0x39a/0x510 net/socket.c:2413
       ___sys_sendmsg net/socket.c:2467 [inline]
       __sys_sendmsg+0x195/0x230 net/socket.c:2496
       __do_sys_sendmsg net/socket.c:2505 [inline]
       __se_sys_sendmsg net/socket.c:2503 [inline]
       __x64_sys_sendmsg+0x42/0x50 net/socket.c:2503
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      read to 0xffff888168ad6410 of 8 bytes by task 13560 on cpu 0:
       qdisc_lookup_rcu+0x30/0x2e0 net/sched/sch_api.c:323
       __tcf_qdisc_find+0x74/0x3a0 net/sched/cls_api.c:1050
       tc_del_tfilter+0x1c7/0x1350 net/sched/cls_api.c:2211
       rtnetlink_rcv_msg+0x5ba/0x7e0 net/core/rtnetlink.c:5585
       netlink_rcv_skb+0x14e/0x250 net/netlink/af_netlink.c:2494
       rtnetlink_rcv+0x18/0x20 net/core/rtnetlink.c:5612
       netlink_unicast_kernel net/netlink/af_netlink.c:1317 [inline]
       netlink_unicast+0x602/0x6d0 net/netlink/af_netlink.c:1343
       netlink_sendmsg+0x728/0x850 net/netlink/af_netlink.c:1919
       sock_sendmsg_nosec net/socket.c:705 [inline]
       sock_sendmsg net/socket.c:725 [inline]
       ____sys_sendmsg+0x39a/0x510 net/socket.c:2413
       ___sys_sendmsg net/socket.c:2467 [inline]
       __sys_sendmsg+0x195/0x230 net/socket.c:2496
       __do_sys_sendmsg net/socket.c:2505 [inline]
       __se_sys_sendmsg net/socket.c:2503 [inline]
       __x64_sys_sendmsg+0x42/0x50 net/socket.c:2503
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      value changed: 0xffffffff85dee080 -> 0xffff88815d96ec00
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 0 PID: 13560 Comm: syz-executor.2 Not tainted 5.17.0-rc3-syzkaller-00116-gf1baf68e-dirty #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      
      Fixes: 470502de ("net: sched: unlock rules update API")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Vlad Buslov <vladbu@mellanox.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Jiri Pirko <jiri@resnulli.us>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5891cd5e
  4. 10 2月, 2022 1 次提交
  5. 02 2月, 2022 1 次提交
  6. 06 1月, 2022 1 次提交
  7. 29 11月, 2021 1 次提交
    • S
      net: Write lock dev_base_lock without disabling bottom halves. · fd888e85
      Sebastian Andrzej Siewior 提交于
      The writer acquires dev_base_lock with disabled bottom halves.
      The reader can acquire dev_base_lock without disabling bottom halves
      because there is no writer in softirq context.
      
      On PREEMPT_RT the softirqs are preemptible and local_bh_disable() acts
      as a lock to ensure that resources, that are protected by disabling
      bottom halves, remain protected.
      This leads to a circular locking dependency if the lock acquired with
      disabled bottom halves (as in write_lock_bh()) and somewhere else with
      enabled bottom halves (as by read_lock() in netstat_show()) followed by
      disabling bottom halves (cxgb_get_stats() -> t4_wr_mbox_meat_timeout()
      -> spin_lock_bh()). This is the reverse locking order.
      
      All read_lock() invocation are from sysfs callback which are not invoked
      from softirq context. Therefore there is no need to disable bottom
      halves while acquiring a write lock.
      
      Acquire the write lock of dev_base_lock without disabling bottom halves.
      Reported-by: NPei Zhang <pezhang@redhat.com>
      Reported-by: NLuis Claudio R. Goncalves <lgoncalv@redhat.com>
      Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fd888e85
  8. 23 11月, 2021 1 次提交
    • J
      net: remove .ndo_change_proto_down · 2106efda
      Jakub Kicinski 提交于
      .ndo_change_proto_down was added seemingly to enable out-of-tree
      implementations. Over 2.5yrs later we still have no real users
      upstream. Hardwire the generic implementation for now, we can
      revert once real users materialize. (rocker is a test vehicle,
      not a user.)
      
      We need to drop the optimization on the sysfs side, because
      unlike ndos priv_flags will be changed at runtime, so we'd
      need READ_ONCE/WRITE_ONCE everywhere..
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2106efda
  9. 22 11月, 2021 1 次提交
  10. 24 10月, 2021 1 次提交
  11. 21 10月, 2021 1 次提交
  12. 16 10月, 2021 1 次提交
  13. 06 10月, 2021 1 次提交
  14. 19 9月, 2021 1 次提交
  15. 26 8月, 2021 1 次提交
    • A
      rtnetlink: Return correct error on changing device netns · 96a6b93b
      Andrey Ignatov 提交于
      Currently when device is moved between network namespaces using
      RTM_NEWLINK message type and one of netns attributes (FLA_NET_NS_PID,
      IFLA_NET_NS_FD, IFLA_TARGET_NETNSID) but w/o specifying IFLA_IFNAME, and
      target namespace already has device with same name, userspace will get
      EINVAL what is confusing and makes debugging harder.
      
      Fix it so that userspace gets more appropriate EEXIST instead what makes
      debugging much easier.
      
      Before:
      
        # ./ifname.sh
        + ip netns add ns0
        + ip netns exec ns0 ip link add l0 type dummy
        + ip netns exec ns0 ip link show l0
        8: l0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
            link/ether 66:90:b5:d5:78:69 brd ff:ff:ff:ff:ff:ff
        + ip link add l0 type dummy
        + ip link show l0
        10: l0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
            link/ether 6e:c6:1f:15:20:8d brd ff:ff:ff:ff:ff:ff
        + ip link set l0 netns ns0
        RTNETLINK answers: Invalid argument
      
      After:
      
        # ./ifname.sh
        + ip netns add ns0
        + ip netns exec ns0 ip link add l0 type dummy
        + ip netns exec ns0 ip link show l0
        8: l0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
            link/ether 1e:4a:72:e3:e3:8f brd ff:ff:ff:ff:ff:ff
        + ip link add l0 type dummy
        + ip link show l0
        10: l0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
            link/ether f2:fc:fe:2b:7d:a6 brd ff:ff:ff:ff:ff:ff
        + ip link set l0 netns ns0
        RTNETLINK answers: File exists
      
      The problem is that do_setlink() passes its `char *ifname` argument,
      that it gets from a caller, to __dev_change_net_namespace() as is (as
      `const char *pat`), but semantics of ifname and pat can be different.
      
      For example, __rtnl_newlink() does this:
      
      net/core/rtnetlink.c
          3270	char ifname[IFNAMSIZ];
           ...
          3286	if (tb[IFLA_IFNAME])
          3287		nla_strscpy(ifname, tb[IFLA_IFNAME], IFNAMSIZ);
          3288	else
          3289		ifname[0] = '\0';
           ...
          3364	if (dev) {
           ...
          3394		return do_setlink(skb, dev, ifm, extack, tb, ifname, status);
          3395	}
      
      , i.e. do_setlink() gets ifname pointer that is always valid no matter
      if user specified IFLA_IFNAME or not and then do_setlink() passes this
      ifname pointer as is to __dev_change_net_namespace() as pat argument.
      
      But the pat (pattern) in __dev_change_net_namespace() is used as:
      
      net/core/dev.c
         11198	err = -EEXIST;
         11199	if (__dev_get_by_name(net, dev->name)) {
         11200		/* We get here if we can't use the current device name */
         11201		if (!pat)
         11202			goto out;
         11203		err = dev_get_valid_name(net, dev, pat);
         11204		if (err < 0)
         11205			goto out;
         11206	}
      
      As the result the `goto out` path on line 11202 is neven taken and
      instead of returning EEXIST defined on line 11198,
      __dev_change_net_namespace() returns an error from dev_get_valid_name()
      and this, in turn, will be EINVAL for ifname[0] = '\0' set earlier.
      
      Fixes: d8a5ec67 ("[NET]: netlink support for moving devices between network namespaces.")
      Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      96a6b93b
  16. 11 8月, 2021 1 次提交
  17. 04 8月, 2021 1 次提交
  18. 27 7月, 2021 1 次提交
  19. 17 7月, 2021 1 次提交
  20. 30 6月, 2021 2 次提交
  21. 13 6月, 2021 3 次提交
  22. 10 6月, 2021 1 次提交
    • I
      rtnetlink: Fix regression in bridge VLAN configuration · d2e381c4
      Ido Schimmel 提交于
      Cited commit started returning errors when notification info is not
      filled by the bridge driver, resulting in the following regression:
      
       # ip link add name br1 type bridge vlan_filtering 1
       # bridge vlan add dev br1 vid 555 self pvid untagged
       RTNETLINK answers: Invalid argument
      
      As long as the bridge driver does not fill notification info for the
      bridge device itself, an empty notification should not be considered as
      an error. This is explained in commit 59ccaaaa ("bridge: dont send
      notification when skb->len == 0 in rtnl_bridge_notify").
      
      Fix by removing the error and add a comment to avoid future bugs.
      
      Fixes: a8db57c1 ("rtnetlink: Fix missing error code in rtnl_bridge_notify()")
      Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: NNikolay Aleksandrov <nikolay@nvidia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d2e381c4
  23. 04 6月, 2021 2 次提交
  24. 11 5月, 2021 1 次提交
    • C
      rtnetlink: avoid RCU read lock when holding RTNL · a100243d
      Cong Wang 提交于
      When we call af_ops->set_link_af() we hold a RCU read lock
      as we retrieve af_ops from the RCU protected list, but this
      is unnecessary because we already hold RTNL lock, which is
      the writer lock for protecting rtnl_af_ops, so it is safer
      than RCU read lock. Similar for af_ops->validate_link_af().
      
      This was not a problem until we begin to take mutex lock
      down the path of ->set_link_af() in __ipv6_dev_mc_dec()
      recently. We can just drop the RCU read lock there and
      assert RTNL lock.
      
      Reported-and-tested-by: syzbot+7d941e89dd48bcf42573@syzkaller.appspotmail.com
      Fixes: 63ed8de4 ("mld: add mc_lock for protecting per-interface mld data")
      Tested-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NCong Wang <cong.wang@bytedance.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a100243d
  25. 09 4月, 2021 1 次提交
  26. 08 4月, 2021 2 次提交
  27. 06 4月, 2021 1 次提交
  28. 04 3月, 2021 1 次提交
  29. 12 2月, 2021 1 次提交
    • C
      net: fix dev_ifsioc_locked() race condition · 3b23a32a
      Cong Wang 提交于
      dev_ifsioc_locked() is called with only RCU read lock, so when
      there is a parallel writer changing the mac address, it could
      get a partially updated mac address, as shown below:
      
      Thread 1			Thread 2
      // eth_commit_mac_addr_change()
      memcpy(dev->dev_addr, addr->sa_data, ETH_ALEN);
      				// dev_ifsioc_locked()
      				memcpy(ifr->ifr_hwaddr.sa_data,
      					dev->dev_addr,...);
      
      Close this race condition by guarding them with a RW semaphore,
      like netdev_get_name(). We can not use seqlock here as it does not
      allow blocking. The writers already take RTNL anyway, so this does
      not affect the slow path. To avoid bothering existing
      dev_set_mac_address() callers in drivers, introduce a new wrapper
      just for user-facing callers on ioctl and rtnetlink paths.
      
      Note, bonding also changes slave mac addresses but that requires
      a separate patch due to the complexity of bonding code.
      
      Fixes: 3710becf ("net: RCU locking for simple ioctl()")
      Reported-by: N"Gong, Sishuai" <sishuai@purdue.edu>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Jakub Kicinski <kuba@kernel.org>
      Signed-off-by: NCong Wang <cong.wang@bytedance.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3b23a32a
  30. 28 1月, 2021 1 次提交
    • N
      net: bridge: multicast: make tracked EHT hosts limit configurable · 2dba407f
      Nikolay Aleksandrov 提交于
      Add two new port attributes which make EHT hosts limit configurable and
      export the current number of tracked EHT hosts:
       - IFLA_BRPORT_MCAST_EHT_HOSTS_LIMIT: configure/retrieve current limit
       - IFLA_BRPORT_MCAST_EHT_HOSTS_CNT: current number of tracked hosts
      Setting IFLA_BRPORT_MCAST_EHT_HOSTS_LIMIT to 0 is currently not allowed.
      
      Note that we have to increase RTNL_SLAVE_MAX_TYPE to 38 minimum, I've
      increased it to 40 to have space for two more future entries.
      
      v2: move br_multicast_eht_set_hosts_limit() to br_multicast_eht.c,
          no functional change
      Signed-off-by: NNikolay Aleksandrov <nikolay@nvidia.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      2dba407f
  31. 09 1月, 2021 2 次提交
    • J
      net: make free_netdev() more lenient with unregistering devices · c269a24c
      Jakub Kicinski 提交于
      There are two flavors of handling netdev registration:
       - ones called without holding rtnl_lock: register_netdev() and
         unregister_netdev(); and
       - those called with rtnl_lock held: register_netdevice() and
         unregister_netdevice().
      
      While the semantics of the former are pretty clear, the same can't
      be said about the latter. The netdev_todo mechanism is utilized to
      perform some of the device unregistering tasks and it hooks into
      rtnl_unlock() so the locked variants can't actually finish the work.
      In general free_netdev() does not mix well with locked calls. Most
      drivers operating under rtnl_lock set dev->needs_free_netdev to true
      and expect core to make the free_netdev() call some time later.
      
      The part where this becomes most problematic is error paths. There is
      no way to unwind the state cleanly after a call to register_netdevice(),
      since unreg can't be performed fully without dropping locks.
      
      Make free_netdev() more lenient, and defer the freeing if device
      is being unregistered. This allows error paths to simply call
      free_netdev() both after register_netdevice() failed, and after
      a call to unregister_netdevice() but before dropping rtnl_lock.
      
      Simplify the error paths which are currently doing gymnastics
      around free_netdev() handling.
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      c269a24c
    • J
      docs: net: explain struct net_device lifetime · 2b446e65
      Jakub Kicinski 提交于
      Explain the two basic flows of struct net_device's operation.
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      2b446e65