1. 03 4月, 2023 1 次提交
    • V
      net: create a netdev notifier for DSA to reject PTP on DSA master · 88c0a6b5
      Vladimir Oltean 提交于
      The fact that PTP 2-step TX timestamping is broken on DSA switches if
      the master also timestamps the same packets is documented by commit
      f685e609 ("net: dsa: Deny PTP on master if switch supports it").
      We attempt to help the users avoid shooting themselves in the foot by
      making DSA reject the timestamping ioctls on an interface that is a DSA
      master, and the switch tree beneath it contains switches which are aware
      of PTP.
      
      The only problem is that there isn't an established way of intercepting
      ndo_eth_ioctl calls, so DSA creates avoidable burden upon the network
      stack by creating a struct dsa_netdevice_ops with overlaid function
      pointers that are manually checked from the relevant call sites. There
      used to be 2 such dsa_netdevice_ops, but now, ndo_eth_ioctl is the only
      one left.
      
      There is an ongoing effort to migrate driver-visible hardware timestamping
      control from the ndo_eth_ioctl() based API to a new ndo_hwtstamp_set()
      model, but DSA actively prevents that migration, since dsa_master_ioctl()
      is currently coded to manually call the master's legacy ndo_eth_ioctl(),
      and so, whenever a network device driver would be converted to the new
      API, DSA's restrictions would be circumvented, because any device could
      be used as a DSA master.
      
      The established way for unrelated modules to react on a net device event
      is via netdevice notifiers. So we create a new notifier which gets
      called whenever there is an attempt to change hardware timestamping
      settings on a device.
      
      Finally, there is another reason why a netdev notifier will be a good
      idea, besides strictly DSA, and this has to do with PHY timestamping.
      
      With ndo_eth_ioctl(), all MAC drivers must manually call
      phy_has_hwtstamp() before deciding whether to act upon SIOCSHWTSTAMP,
      otherwise they must pass this ioctl to the PHY driver via
      phy_mii_ioctl().
      
      With the new ndo_hwtstamp_set() API, it will be desirable to simply not
      make any calls into the MAC device driver when timestamping should be
      performed at the PHY level.
      
      But there exist drivers, such as the lan966x switch, which need to
      install packet traps for PTP regardless of whether they are the layer
      that provides the hardware timestamps, or the PHY is. That would be
      impossible to support with the new API.
      
      The proposal there, too, is to introduce a netdev notifier which acts as
      a better cue for switching drivers to add or remove PTP packet traps,
      than ndo_hwtstamp_set(). The one introduced here "almost" works there as
      well, except for the fact that packet traps should only be installed if
      the PHY driver succeeded to enable hardware timestamping, whereas here,
      we need to deny hardware timestamping on the DSA master before it
      actually gets enabled. This is why this notifier is called "PRE_", and
      the notifier that would get used for PHY timestamping and packet traps
      would be called NETDEV_CHANGE_HWTSTAMP. This isn't a new concept, for
      example NETDEV_CHANGEUPPER and NETDEV_PRECHANGEUPPER do the same thing.
      
      In expectation of future netlink UAPI, we also pass a non-NULL extack
      pointer to the netdev notifier, and we make DSA populate it with an
      informative reason for the rejection. To avoid making it go to waste, we
      make the ioctl-based dev_set_hwtstamp() create a fake extack and print
      the message to the kernel log.
      
      Link: https://lore.kernel.org/netdev/20230401191215.tvveoi3lkawgg6g4@skbuf/
      Link: https://lore.kernel.org/netdev/20230310164451.ls7bbs6pdzs4m6pw@skbuf/Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      88c0a6b5
  2. 30 3月, 2023 4 次提交
  3. 23 3月, 2023 2 次提交
  4. 08 3月, 2023 1 次提交
  5. 24 2月, 2023 1 次提交
  6. 20 2月, 2023 1 次提交
    • E
      net: add location to trace_consume_skb() · dd1b5278
      Eric Dumazet 提交于
      kfree_skb() includes the location, it makes sense
      to add it to consume_skb() as well.
      
      After patch:
      
       taskd_EventMana  8602 [004]   420.406239: skb:consume_skb: skbaddr=0xffff893a4a6d0500 location=unix_stream_read_generic
               swapper     0 [011]   422.732607: skb:consume_skb: skbaddr=0xffff89597f68cee0 location=mlx4_en_free_tx_desc
            discipline  9141 [043]   423.065653: skb:consume_skb: skbaddr=0xffff893a487e9c00 location=skb_consume_udp
               swapper     0 [010]   423.073166: skb:consume_skb: skbaddr=0xffff8949ce9cdb00 location=icmpv6_rcv
               borglet  8672 [014]   425.628256: skb:consume_skb: skbaddr=0xffff8949c42e9400 location=netlink_dump
               swapper     0 [028]   426.263317: skb:consume_skb: skbaddr=0xffff893b1589dce0 location=net_rx_action
                  wget 14339 [009]   426.686380: skb:consume_skb: skbaddr=0xffff893a51b552e0 location=tcp_rcv_state_process
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dd1b5278
  7. 16 2月, 2023 3 次提交
    • I
      devlink: Fix netdev notifier chain corruption · b20b8aec
      Ido Schimmel 提交于
      Cited commit changed devlink to register its netdev notifier block on
      the global netdev notifier chain instead of on the per network namespace
      one.
      
      However, when changing the network namespace of the devlink instance,
      devlink still tries to unregister its notifier block from the chain of
      the old namespace and register it on the chain of the new namespace.
      This results in corruption of the notifier chains, as the same notifier
      block is registered on two different chains: The global one and the per
      network namespace one. In turn, this causes other problems such as the
      inability to dismantle namespaces due to netdev reference count issues.
      
      Fix by preventing devlink from moving its notifier block between
      namespaces.
      
      Reproducer:
      
       # echo "10 1" > /sys/bus/netdevsim/new_device
       # ip netns add test123
       # devlink dev reload netdevsim/netdevsim10 netns test123
       # ip netns del test123
       [   71.935619] unregister_netdevice: waiting for lo to become free. Usage count = 2
       [   71.938348] leaked reference.
      
      Fixes: 565b4824 ("devlink: change port event netdev notifier from per-net to global")
      Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: NJiri Pirko <jiri@nvidia.com>
      Reviewed-by: NJacob Keller <jacob.e.keller@intel.com>
      Reviewed-by: NJakub Kicinski <kuba@kernel.org>
      Link: https://lore.kernel.org/r/20230215073139.1360108-1-idosch@nvidia.comSigned-off-by: NPaolo Abeni <pabeni@redhat.com>
      b20b8aec
    • J
      net/core: refactor promiscuous mode message · 3ba0bf47
      Jesse Brandeburg 提交于
      The kernel stack can be more consistent by printing the IFF_PROMISC
      aka promiscuous enable/disable messages with the standard netdev_info
      message which can include bus and driver info as well as the device.
      
      typical command usage from user space looks like:
      ip link set eth0 promisc <on|off>
      
      But lots of utilities such as bridge, tcpdump, etc put the interface into
      promiscuous mode.
      
      old message:
      [  406.034418] device eth0 entered promiscuous mode
      [  408.424703] device eth0 left promiscuous mode
      
      new message:
      [  406.034431] ice 0000:17:00.0 eth0: entered promiscuous mode
      [  408.424715] ice 0000:17:00.0 eth0: left promiscuous mode
      Signed-off-by: NJesse Brandeburg <jesse.brandeburg@intel.com>
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      3ba0bf47
    • J
      net/core: print message for allmulticast · 802dcbd6
      Jesse Brandeburg 提交于
      When the user sets or clears the IFF_ALLMULTI flag in the netdev, there are
      no log messages printed to the kernel log to indicate anything happened.
      This is inexplicably different from most other dev->flags changes, and
      could suprise the user.
      
      Typically this occurs from user-space when a user:
      ip link set eth0 allmulticast <on|off>
      
      However, other devices like bridge set allmulticast as well, and many
      other flows might trigger entry into allmulticast as well.
      
      The new message uses the standard netdev_info print and looks like:
      [  413.246110] ixgbe 0000:17:00.0 eth0: entered allmulticast mode
      [  415.977184] ixgbe 0000:17:00.0 eth0: left allmulticast mode
      Signed-off-by: NJesse Brandeburg <jesse.brandeburg@intel.com>
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      802dcbd6
  8. 13 2月, 2023 1 次提交
  9. 03 2月, 2023 1 次提交
  10. 02 2月, 2023 1 次提交
  11. 24 1月, 2023 3 次提交
  12. 19 12月, 2022 1 次提交
  13. 04 12月, 2022 1 次提交
  14. 18 11月, 2022 1 次提交
  15. 16 11月, 2022 4 次提交
  16. 10 11月, 2022 1 次提交
  17. 09 11月, 2022 1 次提交
    • A
      net/core: Allow live renaming when an interface is up · bd039b5e
      Andy Ren 提交于
      Allow a network interface to be renamed when the interface
      is up.
      
      As described in the netconsole documentation [1], when netconsole is
      used as a built-in, it will bring up the specified interface as soon as
      possible. As a result, user space will not be able to rename the
      interface since the kernel disallows renaming of interfaces that are
      administratively up unless the 'IFF_LIVE_RENAME_OK' private flag was set
      by the kernel.
      
      The original solution [2] to this problem was to add a new parameter to
      the netconsole configuration parameters that allows renaming of
      the interface used by netconsole while it is administratively up.
      However, during the discussion that followed, it became apparent that we
      have no reason to keep the current restriction and instead we should
      allow user space to rename interfaces regardless of their administrative
      state:
      
      1. The restriction was put in place over 20 years ago when renaming was
      only possible via IOCTL and before rtnetlink started notifying user
      space about such changes like it does today.
      
      2. The 'IFF_LIVE_RENAME_OK' flag was added over 3 years ago in version
      5.2 and no regressions were reported.
      
      3. In-kernel listeners to 'NETDEV_CHANGENAME' do not seem to care about
      the administrative state of interface.
      
      Therefore, allow user space to rename running interfaces by removing the
      restriction and the associated 'IFF_LIVE_RENAME_OK' flag. Help in
      possible triage by emitting a message to the kernel log that an
      interface was renamed while UP.
      
      [1] https://www.kernel.org/doc/Documentation/networking/netconsole.rst
      [2] https://lore.kernel.org/netdev/20221102002420.2613004-1-andy.ren@getcruise.com/Signed-off-by: NAndy Ren <andy.ren@getcruise.com>
      Reviewed-by: NIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bd039b5e
  18. 04 11月, 2022 1 次提交
    • J
      net: devlink: track netdev with devlink_port assigned · 02a68a47
      Jiri Pirko 提交于
      Currently, ethernet drivers are using devlink_port_type_eth_set() and
      devlink_port_type_clear() to set devlink port type and link to related
      netdev.
      
      Instead of calling them directly, let the driver use
      SET_NETDEV_DEVLINK_PORT macro to assign devlink_port pointer and let
      devlink to track it. Note the devlink port pointer is static during
      the time netdevice is registered.
      
      In devlink code, use per-namespace netdev notifier to track
      the netdevices with devlink_port assigned and change the internal
      devlink_port type and related type pointer accordingly.
      Signed-off-by: NJiri Pirko <jiri@nvidia.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      02a68a47
  19. 01 11月, 2022 2 次提交
  20. 29 10月, 2022 1 次提交
  21. 26 10月, 2022 1 次提交
    • K
      net: dev: Convert sa_data to flexible array in struct sockaddr · b5f0de6d
      Kees Cook 提交于
      One of the worst offenders of "fake flexible arrays" is struct sockaddr,
      as it is the classic example of why GCC and Clang have been traditionally
      forced to treat all trailing arrays as fake flexible arrays: in the
      distant misty past, sa_data became too small, and code started just
      treating it as a flexible array, even though it was fixed-size. The
      special case by the compiler is specifically that sizeof(sa->sa_data)
      and FORTIFY_SOURCE (which uses __builtin_object_size(sa->sa_data, 1))
      do not agree (14 and -1 respectively), which makes FORTIFY_SOURCE treat
      it as a flexible array.
      
      However, the coming -fstrict-flex-arrays compiler flag will remove
      these special cases so that FORTIFY_SOURCE can gain coverage over all
      the trailing arrays in the kernel that are _not_ supposed to be treated
      as a flexible array. To deal with this change, convert sa_data to a true
      flexible array. To keep the structure size the same, move sa_data into
      a union with a newly introduced sa_data_min with the original size. The
      result is that FORTIFY_SOURCE can continue to have no idea how large
      sa_data may actually be, but anything using sizeof(sa->sa_data) must
      switch to sizeof(sa->sa_data_min).
      
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Pavel Begunkov <asml.silence@gmail.com>
      Cc: David Ahern <dsahern@kernel.org>
      Cc: Dylan Yudaken <dylany@fb.com>
      Cc: Yajun Deng <yajun.deng@linux.dev>
      Cc: Petr Machata <petrm@nvidia.com>
      Cc: Hangbin Liu <liuhangbin@gmail.com>
      Cc: Leon Romanovsky <leon@kernel.org>
      Cc: syzbot <syzkaller@googlegroups.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Cc: Pablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Link: https://lore.kernel.org/r/20221018095503.never.671-kees@kernel.orgSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      b5f0de6d
  22. 19 10月, 2022 1 次提交
    • P
      net: Fix return value of qdisc ingress handling on success · 672e97ef
      Paul Blakey 提交于
      Currently qdisc ingress handling (sch_handle_ingress()) doesn't
      set a return value and it is left to the old return value of
      the caller (__netif_receive_skb_core()) which is RX drop, so if
      the packet is consumed, caller will stop and return this value
      as if the packet was dropped.
      
      This causes a problem in the kernel tcp stack when having a
      egress tc rule forwarding to a ingress tc rule.
      The tcp stack sending packets on the device having the egress rule
      will see the packets as not successfully transmitted (although they
      actually were), will not advance it's internal state of sent data,
      and packets returning on such tcp stream will be dropped by the tcp
      stack with reason ack-of-unsent-data. See reproduction in [0] below.
      
      Fix that by setting the return value to RX success if
      the packet was handled successfully.
      
      [0] Reproduction steps:
       $ ip link add veth1 type veth peer name peer1
       $ ip link add veth2 type veth peer name peer2
       $ ifconfig peer1 5.5.5.6/24 up
       $ ip netns add ns0
       $ ip link set dev peer2 netns ns0
       $ ip netns exec ns0 ifconfig peer2 5.5.5.5/24 up
       $ ifconfig veth2 0 up
       $ ifconfig veth1 0 up
      
       #ingress forwarding veth1 <-> veth2
       $ tc qdisc add dev veth2 ingress
       $ tc qdisc add dev veth1 ingress
       $ tc filter add dev veth2 ingress prio 1 proto all flower \
         action mirred egress redirect dev veth1
       $ tc filter add dev veth1 ingress prio 1 proto all flower \
         action mirred egress redirect dev veth2
      
       #steal packet from peer1 egress to veth2 ingress, bypassing the veth pipe
       $ tc qdisc add dev peer1 clsact
       $ tc filter add dev peer1 egress prio 20 proto ip flower \
         action mirred ingress redirect dev veth1
      
       #run iperf and see connection not running
       $ iperf3 -s&
       $ ip netns exec ns0 iperf3 -c 5.5.5.6 -i 1
      
       #delete egress rule, and run again, now should work
       $ tc filter del dev peer1 egress
       $ ip netns exec ns0 iperf3 -c 5.5.5.6 -i 1
      
      Fixes: f697c3e8 ("[NET]: Avoid unnecessary cloning for ingress filtering")
      Signed-off-by: NPaul Blakey <paulb@nvidia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      672e97ef
  23. 30 9月, 2022 1 次提交
  24. 24 8月, 2022 5 次提交