1. 28 7月, 2021 2 次提交
    • A
      dev_ioctl: pass SIOCDEVPRIVATE data separately · a554bf96
      Arnd Bergmann 提交于
      The compat handlers for SIOCDEVPRIVATE are incorrect for any driver that
      passes data as part of struct ifreq rather than as an ifr_data pointer, or
      that passes data back this way, since the compat_ifr_data_ioctl() helper
      overwrites the ifr_data pointer and does not copy anything back out.
      
      Since all drivers using devprivate commands are now converted to the
      new .ndo_siocdevprivate callback, fix this by adding the missing piece
      and passing the pointer separately the whole way.
      
      This further unifies the native and compat logic for socket ioctls,
      as the new code now passes the correct pointer as well as the correct
      data for both native and compat ioctls.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a554bf96
    • A
      net: split out SIOCDEVPRIVATE handling from dev_ioctl · b9067f5d
      Arnd Bergmann 提交于
      SIOCDEVPRIVATE ioctl commands are mainly used in really old
      drivers, and they have a number of problems:
      
      - They hide behind the normal .ndo_do_ioctl function that
        is also used for other things in modern drivers, so it's
        hard to spot a driver that actually uses one of these
      
      - Since drivers use a number different calling conventions,
        it is impossible to support compat mode for them in
        a generic way.
      
      - With all drivers using the same 16 commands codes, there
        is no way to introspect the data being passed through
        things like strace.
      
      Add a new net_device_ops callback pointer, to address the
      first two of these. Separating them from .ndo_do_ioctl
      makes it easy to grep for drivers with a .ndo_siocdevprivate
      callback, and the unwieldy name hopefully makes it easier
      to spot in code review.
      
      By passing the ifreq structure and the ifr_data pointer
      separately, it is no longer necessary to overload these,
      and the driver can use either one for a given command.
      
      Cc: Cong Wang <cong.wang@bytedance.com>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b9067f5d
  2. 23 7月, 2021 3 次提交
    • A
      net: socket: rework compat_ifreq_ioctl() · 29c49648
      Arnd Bergmann 提交于
      compat_ifreq_ioctl() is one of the last users of copy_in_user() and
      compat_alloc_user_space(), as it attempts to convert the 'struct ifreq'
      arguments from 32-bit to 64-bit format as used by dev_ioctl() and a
      couple of socket family specific interpretations.
      
      The current implementation works correctly when calling dev_ioctl(),
      inet_ioctl(), ieee802154_sock_ioctl(), atalk_ioctl(), qrtr_ioctl()
      and packet_ioctl(). The ioctl handlers for x25, netrom, rose and x25 do
      not interpret the arguments and only block the corresponding commands,
      so they do not care.
      
      For af_inet6 and af_decnet however, the compat conversion is slightly
      incorrect, as it will copy more data than the native handler accesses,
      both of them use a structure that is shorter than ifreq.
      
      Replace the copy_in_user() conversion with a pair of accessor functions
      to read and write the ifreq data in place with the correct length where
      needed, while leaving the other ones to copy the (already compatible)
      structures directly.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      29c49648
    • A
      net: socket: simplify dev_ifconf handling · 876f0bf9
      Arnd Bergmann 提交于
      The dev_ifconf() calling conventions make compat handling
      more complicated than necessary, simplify this by moving
      the in_compat_syscall() check into the function.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      876f0bf9
    • A
      net: socket: remove register_gifconf · b0e99d03
      Arnd Bergmann 提交于
      Since dynamic registration of the gifconf() helper is only used for
      IPv4, and this can not be in a loadable module, this can be simplified
      noticeably by turning it into a direct function call as a preparation
      for cleaning up the compat handling.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b0e99d03
  3. 08 7月, 2021 1 次提交
  4. 26 6月, 2021 1 次提交
    • N
      dev_forward_skb: do not scrub skb mark within the same name space · ff70202b
      Nicolas Dichtel 提交于
      The goal is to keep the mark during a bpf_redirect(), like it is done for
      legacy encapsulation / decapsulation, when there is no x-netns.
      This was initially done in commit 213dd74a ("skbuff: Do not scrub skb
      mark within the same name space").
      
      When the call to skb_scrub_packet() was added in dev_forward_skb() (commit
      8b27f277 ("skb: allow skb_scrub_packet() to be used by tunnels")), the
      second argument (xnet) was set to true to force a call to skb_orphan(). At
      this time, the mark was always cleanned up by skb_scrub_packet(), whatever
      xnet value was.
      This call to skb_orphan() was removed later in commit
      9c4c3252 ("skbuff: preserve sock reference when scrubbing the skb.").
      But this 'true' stayed here without any real reason.
      
      Let's correctly set xnet in ____dev_forward_skb(), this function has access
      to the previous interface and to the new interface.
      Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ff70202b
  5. 04 6月, 2021 1 次提交
    • J
      mlx5: count all link events · 490dceca
      Jakub Kicinski 提交于
      mlx5 devices were observed generating MLX5_PORT_CHANGE_SUBTYPE_ACTIVE
      events without an intervening MLX5_PORT_CHANGE_SUBTYPE_DOWN. This
      breaks link flap detection based on Linux carrier state transition
      count as netif_carrier_on() does nothing if carrier is already on.
      Make sure we count such events.
      
      netif_carrier_event() increments the counters and fires the linkwatch
      events. The latter is not necessary for the use case but seems like
      the right thing to do.
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NSaeed Mahameed <saeedm@nvidia.com>
      490dceca
  6. 08 4月, 2021 1 次提交
  7. 06 4月, 2021 1 次提交
  8. 25 3月, 2021 7 次提交
  9. 24 3月, 2021 1 次提交
    • D
      net: make unregister netdev warning timeout configurable · 5aa3afe1
      Dmitry Vyukov 提交于
      netdev_wait_allrefs() issues a warning if refcount does not drop to 0
      after 10 seconds. While 10 second wait generally should not happen
      under normal workload in normal environment, it seems to fire falsely
      very often during fuzzing and/or in qemu emulation (~10x slower).
      At least it's not possible to understand if it's really a false
      positive or not. Automated testing generally bumps all timeouts
      to very high values to avoid flake failures.
      Add net.core.netdev_unregister_timeout_secs sysctl to make
      the timeout configurable for automated testing systems.
      Lowering the timeout may also be useful for e.g. manual bisection.
      The default value matches the current behavior.
      Signed-off-by: NDmitry Vyukov <dvyukov@google.com>
      Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=211877
      Cc: netdev@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5aa3afe1
  10. 23 3月, 2021 3 次提交
    • E
      net: set initial device refcount to 1 · add2d736
      Eric Dumazet 提交于
      When adding CONFIG_PCPU_DEV_REFCNT, I forgot that the
      initial net device refcount was 0.
      
      When CONFIG_PCPU_DEV_REFCNT is not set, this means
      the first dev_hold() triggers an illegal refcount
      operation (addition on 0)
      
      refcount_t: addition on 0; use-after-free.
      WARNING: CPU: 0 PID: 1 at lib/refcount.c:25 refcount_warn_saturate+0x128/0x1a4
      
      Fix is to change initial (and final) refcount to be 1.
      
      Also add a missing kerneldoc piece, as reported by
      Stephen Rothwell.
      
      Fixes: 919067cc ("net: add CONFIG_PCPU_DEV_REFCNT")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NGuenter Roeck <groeck@google.com>
      Tested-by: NGuenter Roeck <groeck@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      add2d736
    • V
      net: move the ptype_all and ptype_base declarations to include/linux/netdevice.h · 744b8376
      Vladimir Oltean 提交于
      ptype_all and ptype_base are declared in net/core/dev.c as non-static,
      because they are used by net-procfs.c too. However, a "make W=1" build
      complains that there was no previous declaration of ptype_all and
      ptype_base in a header file, so this way of declaring things constitutes
      a violation of coding style.
      
      Let's move the extern declarations of ptype_all and ptype_base to the
      linux/netdevice.h file, which is included by net-procfs.c too.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      744b8376
    • V
      netdev: add netdev_queue_set_dql_min_limit() · f57bac3c
      Vincent Mailhol 提交于
      Add a function to set the dynamic queue limit minimum value.
      
      Some specific drivers might have legitimate reasons to configure
      dql.min_limit to a given value. Typically, this is the case when the
      PDU of the protocol is smaller than the packet size to used to
      carry those frames to the device.
      
      Concrete example: a CAN (Control Area Network) device with an USB 2.0
      interface.  The PDU of classical CAN protocol are roughly 16 bytes but
      the USB packet size (which is used to carry the CAN frames to the
      device) might be up to 512 bytes.  Wen small traffic burst occurs, BQL
      algorithm is not able to immediately adjust and this would result in
      having to send many small USB packets (i.e packet of 16 bytes for each
      CAN frame). Filling up the USB packet with CAN frames is relatively
      fast (small latency issue) but the gain of not having to send several
      small USB packets is huge (big throughput increase). In this case,
      forcing dql.min_limit to a given value that would allow to stuff the
      USB packet is always a win.
      
      This function is to be used by network drivers which are able to prove
      through a rationale and through empirical tests on several environment
      (with other applications, heavy context switching, virtualization...),
      that they constantly reach better performances with a specific
      predefined dql.min_limit value with no noticeable latency impact.
      Signed-off-by: NVincent Mailhol <mailhol.vincent@wanadoo.fr>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f57bac3c
  11. 20 3月, 2021 1 次提交
    • E
      net: add CONFIG_PCPU_DEV_REFCNT · 919067cc
      Eric Dumazet 提交于
      I was working on a syzbot issue, claiming one device could not be
      dismantled because its refcount was -1
      
      unregister_netdevice: waiting for sit0 to become free. Usage count = -1
      
      It would be nice if syzbot could trigger a warning at the time
      this reference count became negative.
      
      This patch adds CONFIG_PCPU_DEV_REFCNT options which defaults
      to per cpu variables (as before this patch) on SMP builds.
      
      v2: free_dev label in alloc_netdev_mqs() is moved to avoid
          a compiler warning (-Wunused-label), as reported
          by kernel test robot <lkp@intel.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      919067cc
  12. 19 3月, 2021 3 次提交
  13. 18 3月, 2021 1 次提交
    • W
      net: fix race between napi kthread mode and busy poll · cb038357
      Wei Wang 提交于
      Currently, napi_thread_wait() checks for NAPI_STATE_SCHED bit to
      determine if the kthread owns this napi and could call napi->poll() on
      it. However, if socket busy poll is enabled, it is possible that the
      busy poll thread grabs this SCHED bit (after the previous napi->poll()
      invokes napi_complete_done() and clears SCHED bit) and tries to poll
      on the same napi. napi_disable() could grab the SCHED bit as well.
      This patch tries to fix this race by adding a new bit
      NAPI_STATE_SCHED_THREADED in napi->state. This bit gets set in
      ____napi_schedule() if the threaded mode is enabled, and gets cleared
      in napi_complete_done(), and we only poll the napi in kthread if this
      bit is set. This helps distinguish the ownership of the napi between
      kthread and other scenarios and fixes the race issue.
      
      Fixes: 29863d41 ("net: implement threaded-able napi poll loop support")
      Reported-by: NMartin Zaharinov <micron10@gmail.com>
      Suggested-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NWei Wang <weiwan@google.com>
      Cc: Alexander Duyck <alexanderduyck@fb.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Paolo Abeni <pabeni@redhat.com>
      Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cb038357
  14. 17 3月, 2021 1 次提交
  15. 04 3月, 2021 1 次提交
  16. 25 2月, 2021 3 次提交
  17. 13 2月, 2021 1 次提交
  18. 12 2月, 2021 1 次提交
    • C
      net: fix dev_ifsioc_locked() race condition · 3b23a32a
      Cong Wang 提交于
      dev_ifsioc_locked() is called with only RCU read lock, so when
      there is a parallel writer changing the mac address, it could
      get a partially updated mac address, as shown below:
      
      Thread 1			Thread 2
      // eth_commit_mac_addr_change()
      memcpy(dev->dev_addr, addr->sa_data, ETH_ALEN);
      				// dev_ifsioc_locked()
      				memcpy(ifr->ifr_hwaddr.sa_data,
      					dev->dev_addr,...);
      
      Close this race condition by guarding them with a RW semaphore,
      like netdev_get_name(). We can not use seqlock here as it does not
      allow blocking. The writers already take RTNL anyway, so this does
      not affect the slow path. To avoid bothering existing
      dev_set_mac_address() callers in drivers, introduce a new wrapper
      just for user-facing callers on ioctl and rtnetlink paths.
      
      Note, bonding also changes slave mac addresses but that requires
      a separate patch due to the complexity of bonding code.
      
      Fixes: 3710becf ("net: RCU locking for simple ioctl()")
      Reported-by: N"Gong, Sishuai" <sishuai@purdue.edu>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Jakub Kicinski <kuba@kernel.org>
      Signed-off-by: NCong Wang <cong.wang@bytedance.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3b23a32a
  19. 10 2月, 2021 2 次提交
  20. 09 2月, 2021 1 次提交
  21. 29 1月, 2021 1 次提交
    • J
      net: adjust net_device layout for cacheline usage · 28af22c6
      Jesper Dangaard Brouer 提交于
      The current layout of net_device is not optimal for cacheline usage.
      
      The member adj_list.lower linked list is split between cacheline 2 and 3.
      The ifindex is placed together with stats (struct net_device_stats),
      although most modern drivers don't update this stats member.
      
      The members netdev_ops, mtu and hard_header_len are placed on three
      different cachelines. These members are accessed for XDP redirect into
      devmap, which were noticeably with perf tool. When not using the map
      redirect variant (like TC-BPF does), then ifindex is also used, which is
      placed on a separate fourth cacheline. These members are also accessed
      during forwarding with regular network stack. The members priv_flags and
      flags are on fast-path for network stack transmit path in __dev_queue_xmit
      (currently located together with mtu cacheline).
      
      This patch creates a read mostly cacheline, with the purpose of keeping the
      above mentioned members on the same cacheline.
      
      Some netdev_features_t members also becomes part of this cacheline, which is
      on purpose, as function netif_skb_features() is on fast-path via
      validate_xmit_skb().
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Link: https://lore.kernel.org/r/161168277983.410784.12401225493601624417.stgit@firesoulSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      28af22c6
  22. 23 1月, 2021 1 次提交
    • M
      sch_htb: Hierarchical QoS hardware offload · d03b195b
      Maxim Mikityanskiy 提交于
      HTB doesn't scale well because of contention on a single lock, and it
      also consumes CPU. This patch adds support for offloading HTB to
      hardware that supports hierarchical rate limiting.
      
      In the offload mode, HTB passes control commands to the driver using
      ndo_setup_tc. The driver has to replicate the whole hierarchy of classes
      and their settings (rate, ceil) in the NIC. Every modification of the
      HTB tree caused by the admin results in ndo_setup_tc being called.
      
      After this setup, the HTB algorithm is done completely in the NIC. An SQ
      (send queue) is created for every leaf class and attached to the
      hierarchy, so that the NIC can calculate and obey aggregated rate
      limits, too. In the future, it can be changed, so that multiple SQs will
      back a single leaf class.
      
      ndo_select_queue is responsible for selecting the right queue that
      serves the traffic class of each packet.
      
      The data path works as follows: a packet is classified by clsact, the
      driver selects a hardware queue according to its class, and the packet
      is enqueued into this queue's qdisc.
      
      This solution addresses two main problems of scaling HTB:
      
      1. Contention by flow classification. Currently the filters are attached
      to the HTB instance as follows:
      
          # tc filter add dev eth0 parent 1:0 protocol ip flower dst_port 80
          classid 1:10
      
      It's possible to move classification to clsact egress hook, which is
      thread-safe and lock-free:
      
          # tc filter add dev eth0 egress protocol ip flower dst_port 80
          action skbedit priority 1:10
      
      This way classification still happens in software, but the lock
      contention is eliminated, and it happens before selecting the TX queue,
      allowing the driver to translate the class to the corresponding hardware
      queue in ndo_select_queue.
      
      Note that this is already compatible with non-offloaded HTB and doesn't
      require changes to the kernel nor iproute2.
      
      2. Contention by handling packets. HTB is not multi-queue, it attaches
      to a whole net device, and handling of all packets takes the same lock.
      When HTB is offloaded, it registers itself as a multi-queue qdisc,
      similarly to mq: HTB is attached to the netdev, and each queue has its
      own qdisc.
      
      Some features of HTB may be not supported by some particular hardware,
      for example, the maximum number of classes may be limited, the
      granularity of rate and ceil parameters may be different, etc. - so, the
      offload is not enabled by default, a new parameter is used to enable it:
      
          # tc qdisc replace dev eth0 root handle 1: htb offload
      Signed-off-by: NMaxim Mikityanskiy <maximmi@mellanox.com>
      Reviewed-by: NTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      d03b195b
  23. 20 1月, 2021 1 次提交
    • J
      bonding: add a vlan+srcmac tx hashing option · 7b8fc010
      Jarod Wilson 提交于
      This comes from an end-user request, where they're running multiple VMs on
      hosts with bonded interfaces connected to some interest switch topologies,
      where 802.3ad isn't an option. They're currently running a proprietary
      solution that effectively achieves load-balancing of VMs and bandwidth
      utilization improvements with a similar form of transmission algorithm.
      
      Basically, each VM has it's own vlan, so it always sends its traffic out
      the same interface, unless that interface fails. Traffic gets split
      between the interfaces, maintaining a consistent path, with failover still
      available if an interface goes down.
      
      Unlike bond_eth_hash(), this hash function is using the full source MAC
      address instead of just the last byte, as there are so few components to
      the hash, and in the no-vlan case, we would be returning just the last
      byte of the source MAC as the hash value. It's entirely possible to have
      two NICs in a bond with the same last byte of their MAC, but not the same
      MAC, so this adjustment should guarantee distinct hashes in all cases.
      
      This has been rudimetarily tested to provide similar results to the
      proprietary solution it is aiming to replace. A patch for iproute2 is also
      posted, to properly support the new mode there as well.
      
      Cc: Jay Vosburgh <j.vosburgh@gmail.com>
      Cc: Veaceslav Falico <vfalico@gmail.com>
      Cc: Andy Gospodarek <andy@greyhouse.net>
      Cc: Thomas Davis <tadavis@lbl.gov>
      Signed-off-by: NJarod Wilson <jarod@redhat.com>
      Link: https://lore.kernel.org/r/20210119010927.1191922-1-jarod@redhat.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      7b8fc010
  24. 19 1月, 2021 1 次提交