1. 25 3月, 2021 5 次提交
  2. 24 3月, 2021 1 次提交
    • D
      net: make unregister netdev warning timeout configurable · 5aa3afe1
      Dmitry Vyukov 提交于
      netdev_wait_allrefs() issues a warning if refcount does not drop to 0
      after 10 seconds. While 10 second wait generally should not happen
      under normal workload in normal environment, it seems to fire falsely
      very often during fuzzing and/or in qemu emulation (~10x slower).
      At least it's not possible to understand if it's really a false
      positive or not. Automated testing generally bumps all timeouts
      to very high values to avoid flake failures.
      Add net.core.netdev_unregister_timeout_secs sysctl to make
      the timeout configurable for automated testing systems.
      Lowering the timeout may also be useful for e.g. manual bisection.
      The default value matches the current behavior.
      Signed-off-by: NDmitry Vyukov <dvyukov@google.com>
      Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=211877
      Cc: netdev@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5aa3afe1
  3. 23 3月, 2021 3 次提交
    • E
      net: set initial device refcount to 1 · add2d736
      Eric Dumazet 提交于
      When adding CONFIG_PCPU_DEV_REFCNT, I forgot that the
      initial net device refcount was 0.
      
      When CONFIG_PCPU_DEV_REFCNT is not set, this means
      the first dev_hold() triggers an illegal refcount
      operation (addition on 0)
      
      refcount_t: addition on 0; use-after-free.
      WARNING: CPU: 0 PID: 1 at lib/refcount.c:25 refcount_warn_saturate+0x128/0x1a4
      
      Fix is to change initial (and final) refcount to be 1.
      
      Also add a missing kerneldoc piece, as reported by
      Stephen Rothwell.
      
      Fixes: 919067cc ("net: add CONFIG_PCPU_DEV_REFCNT")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NGuenter Roeck <groeck@google.com>
      Tested-by: NGuenter Roeck <groeck@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      add2d736
    • V
      net: move the ptype_all and ptype_base declarations to include/linux/netdevice.h · 744b8376
      Vladimir Oltean 提交于
      ptype_all and ptype_base are declared in net/core/dev.c as non-static,
      because they are used by net-procfs.c too. However, a "make W=1" build
      complains that there was no previous declaration of ptype_all and
      ptype_base in a header file, so this way of declaring things constitutes
      a violation of coding style.
      
      Let's move the extern declarations of ptype_all and ptype_base to the
      linux/netdevice.h file, which is included by net-procfs.c too.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      744b8376
    • V
      netdev: add netdev_queue_set_dql_min_limit() · f57bac3c
      Vincent Mailhol 提交于
      Add a function to set the dynamic queue limit minimum value.
      
      Some specific drivers might have legitimate reasons to configure
      dql.min_limit to a given value. Typically, this is the case when the
      PDU of the protocol is smaller than the packet size to used to
      carry those frames to the device.
      
      Concrete example: a CAN (Control Area Network) device with an USB 2.0
      interface.  The PDU of classical CAN protocol are roughly 16 bytes but
      the USB packet size (which is used to carry the CAN frames to the
      device) might be up to 512 bytes.  Wen small traffic burst occurs, BQL
      algorithm is not able to immediately adjust and this would result in
      having to send many small USB packets (i.e packet of 16 bytes for each
      CAN frame). Filling up the USB packet with CAN frames is relatively
      fast (small latency issue) but the gain of not having to send several
      small USB packets is huge (big throughput increase). In this case,
      forcing dql.min_limit to a given value that would allow to stuff the
      USB packet is always a win.
      
      This function is to be used by network drivers which are able to prove
      through a rationale and through empirical tests on several environment
      (with other applications, heavy context switching, virtualization...),
      that they constantly reach better performances with a specific
      predefined dql.min_limit value with no noticeable latency impact.
      Signed-off-by: NVincent Mailhol <mailhol.vincent@wanadoo.fr>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f57bac3c
  4. 20 3月, 2021 1 次提交
    • E
      net: add CONFIG_PCPU_DEV_REFCNT · 919067cc
      Eric Dumazet 提交于
      I was working on a syzbot issue, claiming one device could not be
      dismantled because its refcount was -1
      
      unregister_netdevice: waiting for sit0 to become free. Usage count = -1
      
      It would be nice if syzbot could trigger a warning at the time
      this reference count became negative.
      
      This patch adds CONFIG_PCPU_DEV_REFCNT options which defaults
      to per cpu variables (as before this patch) on SMP builds.
      
      v2: free_dev label in alloc_netdev_mqs() is moved to avoid
          a compiler warning (-Wunused-label), as reported
          by kernel test robot <lkp@intel.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      919067cc
  5. 19 3月, 2021 3 次提交
  6. 17 3月, 2021 1 次提交
  7. 04 3月, 2021 1 次提交
  8. 25 2月, 2021 3 次提交
  9. 13 2月, 2021 1 次提交
  10. 12 2月, 2021 1 次提交
    • C
      net: fix dev_ifsioc_locked() race condition · 3b23a32a
      Cong Wang 提交于
      dev_ifsioc_locked() is called with only RCU read lock, so when
      there is a parallel writer changing the mac address, it could
      get a partially updated mac address, as shown below:
      
      Thread 1			Thread 2
      // eth_commit_mac_addr_change()
      memcpy(dev->dev_addr, addr->sa_data, ETH_ALEN);
      				// dev_ifsioc_locked()
      				memcpy(ifr->ifr_hwaddr.sa_data,
      					dev->dev_addr,...);
      
      Close this race condition by guarding them with a RW semaphore,
      like netdev_get_name(). We can not use seqlock here as it does not
      allow blocking. The writers already take RTNL anyway, so this does
      not affect the slow path. To avoid bothering existing
      dev_set_mac_address() callers in drivers, introduce a new wrapper
      just for user-facing callers on ioctl and rtnetlink paths.
      
      Note, bonding also changes slave mac addresses but that requires
      a separate patch due to the complexity of bonding code.
      
      Fixes: 3710becf ("net: RCU locking for simple ioctl()")
      Reported-by: N"Gong, Sishuai" <sishuai@purdue.edu>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Jakub Kicinski <kuba@kernel.org>
      Signed-off-by: NCong Wang <cong.wang@bytedance.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3b23a32a
  11. 10 2月, 2021 2 次提交
  12. 09 2月, 2021 1 次提交
  13. 29 1月, 2021 1 次提交
    • J
      net: adjust net_device layout for cacheline usage · 28af22c6
      Jesper Dangaard Brouer 提交于
      The current layout of net_device is not optimal for cacheline usage.
      
      The member adj_list.lower linked list is split between cacheline 2 and 3.
      The ifindex is placed together with stats (struct net_device_stats),
      although most modern drivers don't update this stats member.
      
      The members netdev_ops, mtu and hard_header_len are placed on three
      different cachelines. These members are accessed for XDP redirect into
      devmap, which were noticeably with perf tool. When not using the map
      redirect variant (like TC-BPF does), then ifindex is also used, which is
      placed on a separate fourth cacheline. These members are also accessed
      during forwarding with regular network stack. The members priv_flags and
      flags are on fast-path for network stack transmit path in __dev_queue_xmit
      (currently located together with mtu cacheline).
      
      This patch creates a read mostly cacheline, with the purpose of keeping the
      above mentioned members on the same cacheline.
      
      Some netdev_features_t members also becomes part of this cacheline, which is
      on purpose, as function netif_skb_features() is on fast-path via
      validate_xmit_skb().
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Link: https://lore.kernel.org/r/161168277983.410784.12401225493601624417.stgit@firesoulSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      28af22c6
  14. 23 1月, 2021 1 次提交
    • M
      sch_htb: Hierarchical QoS hardware offload · d03b195b
      Maxim Mikityanskiy 提交于
      HTB doesn't scale well because of contention on a single lock, and it
      also consumes CPU. This patch adds support for offloading HTB to
      hardware that supports hierarchical rate limiting.
      
      In the offload mode, HTB passes control commands to the driver using
      ndo_setup_tc. The driver has to replicate the whole hierarchy of classes
      and their settings (rate, ceil) in the NIC. Every modification of the
      HTB tree caused by the admin results in ndo_setup_tc being called.
      
      After this setup, the HTB algorithm is done completely in the NIC. An SQ
      (send queue) is created for every leaf class and attached to the
      hierarchy, so that the NIC can calculate and obey aggregated rate
      limits, too. In the future, it can be changed, so that multiple SQs will
      back a single leaf class.
      
      ndo_select_queue is responsible for selecting the right queue that
      serves the traffic class of each packet.
      
      The data path works as follows: a packet is classified by clsact, the
      driver selects a hardware queue according to its class, and the packet
      is enqueued into this queue's qdisc.
      
      This solution addresses two main problems of scaling HTB:
      
      1. Contention by flow classification. Currently the filters are attached
      to the HTB instance as follows:
      
          # tc filter add dev eth0 parent 1:0 protocol ip flower dst_port 80
          classid 1:10
      
      It's possible to move classification to clsact egress hook, which is
      thread-safe and lock-free:
      
          # tc filter add dev eth0 egress protocol ip flower dst_port 80
          action skbedit priority 1:10
      
      This way classification still happens in software, but the lock
      contention is eliminated, and it happens before selecting the TX queue,
      allowing the driver to translate the class to the corresponding hardware
      queue in ndo_select_queue.
      
      Note that this is already compatible with non-offloaded HTB and doesn't
      require changes to the kernel nor iproute2.
      
      2. Contention by handling packets. HTB is not multi-queue, it attaches
      to a whole net device, and handling of all packets takes the same lock.
      When HTB is offloaded, it registers itself as a multi-queue qdisc,
      similarly to mq: HTB is attached to the netdev, and each queue has its
      own qdisc.
      
      Some features of HTB may be not supported by some particular hardware,
      for example, the maximum number of classes may be limited, the
      granularity of rate and ceil parameters may be different, etc. - so, the
      offload is not enabled by default, a new parameter is used to enable it:
      
          # tc qdisc replace dev eth0 root handle 1: htb offload
      Signed-off-by: NMaxim Mikityanskiy <maximmi@mellanox.com>
      Reviewed-by: NTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      d03b195b
  15. 20 1月, 2021 1 次提交
    • J
      bonding: add a vlan+srcmac tx hashing option · 7b8fc010
      Jarod Wilson 提交于
      This comes from an end-user request, where they're running multiple VMs on
      hosts with bonded interfaces connected to some interest switch topologies,
      where 802.3ad isn't an option. They're currently running a proprietary
      solution that effectively achieves load-balancing of VMs and bandwidth
      utilization improvements with a similar form of transmission algorithm.
      
      Basically, each VM has it's own vlan, so it always sends its traffic out
      the same interface, unless that interface fails. Traffic gets split
      between the interfaces, maintaining a consistent path, with failover still
      available if an interface goes down.
      
      Unlike bond_eth_hash(), this hash function is using the full source MAC
      address instead of just the last byte, as there are so few components to
      the hash, and in the no-vlan case, we would be returning just the last
      byte of the source MAC as the hash value. It's entirely possible to have
      two NICs in a bond with the same last byte of their MAC, but not the same
      MAC, so this adjustment should guarantee distinct hashes in all cases.
      
      This has been rudimetarily tested to provide similar results to the
      proprietary solution it is aiming to replace. A patch for iproute2 is also
      posted, to properly support the new mode there as well.
      
      Cc: Jay Vosburgh <j.vosburgh@gmail.com>
      Cc: Veaceslav Falico <vfalico@gmail.com>
      Cc: Andy Gospodarek <andy@greyhouse.net>
      Cc: Thomas Davis <tadavis@lbl.gov>
      Signed-off-by: NJarod Wilson <jarod@redhat.com>
      Link: https://lore.kernel.org/r/20210119010927.1191922-1-jarod@redhat.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      7b8fc010
  16. 19 1月, 2021 1 次提交
  17. 10 1月, 2021 1 次提交
  18. 08 1月, 2021 1 次提交
  19. 17 12月, 2020 1 次提交
  20. 02 12月, 2020 1 次提交
  21. 01 12月, 2020 1 次提交
    • B
      net: Introduce preferred busy-polling · 7fd3253a
      Björn Töpel 提交于
      The existing busy-polling mode, enabled by the SO_BUSY_POLL socket
      option or system-wide using the /proc/sys/net/core/busy_read knob, is
      an opportunistic. That means that if the NAPI context is not
      scheduled, it will poll it. If, after busy-polling, the budget is
      exceeded the busy-polling logic will schedule the NAPI onto the
      regular softirq handling.
      
      One implication of the behavior above is that a busy/heavy loaded NAPI
      context will never enter/allow for busy-polling. Some applications
      prefer that most NAPI processing would be done by busy-polling.
      
      This series adds a new socket option, SO_PREFER_BUSY_POLL, that works
      in concert with the napi_defer_hard_irqs and gro_flush_timeout
      knobs. The napi_defer_hard_irqs and gro_flush_timeout knobs were
      introduced in commit 6f8b12d6 ("net: napi: add hard irqs deferral
      feature"), and allows for a user to defer interrupts to be enabled and
      instead schedule the NAPI context from a watchdog timer. When a user
      enables the SO_PREFER_BUSY_POLL, again with the other knobs enabled,
      and the NAPI context is being processed by a softirq, the softirq NAPI
      processing will exit early to allow the busy-polling to be performed.
      
      If the application stops performing busy-polling via a system call,
      the watchdog timer defined by gro_flush_timeout will timeout, and
      regular softirq handling will resume.
      
      In summary; Heavy traffic applications that prefer busy-polling over
      softirq processing should use this option.
      
      Example usage:
      
        $ echo 2 | sudo tee /sys/class/net/ens785f1/napi_defer_hard_irqs
        $ echo 200000 | sudo tee /sys/class/net/ens785f1/gro_flush_timeout
      
      Note that the timeout should be larger than the userspace processing
      window, otherwise the watchdog will timeout and fall back to regular
      softirq processing.
      
      Enable the SO_BUSY_POLL/SO_PREFER_BUSY_POLL options on your socket.
      Signed-off-by: NBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: NJakub Kicinski <kuba@kernel.org>
      Link: https://lore.kernel.org/bpf/20201130185205.196029-2-bjorn.topel@gmail.com
      7fd3253a
  22. 25 11月, 2020 1 次提交
  23. 24 11月, 2020 2 次提交
  24. 18 11月, 2020 1 次提交
  25. 10 11月, 2020 1 次提交
  26. 01 11月, 2020 2 次提交
  27. 14 10月, 2020 1 次提交