1. 15 5月, 2021 1 次提交
    • Y
      net: sched: fix tx action rescheduling issue during deactivation · 102b55ee
      Yunsheng Lin 提交于
      Currently qdisc_run() checks the STATE_DEACTIVATED of lockless
      qdisc before calling __qdisc_run(), which ultimately clear the
      STATE_MISSED when all the skb is dequeued. If STATE_DEACTIVATED
      is set before clearing STATE_MISSED, there may be rescheduling
      of net_tx_action() at the end of qdisc_run_end(), see below:
      
      CPU0(net_tx_atcion)  CPU1(__dev_xmit_skb)  CPU2(dev_deactivate)
                .                   .                     .
                .            set STATE_MISSED             .
                .           __netif_schedule()            .
                .                   .           set STATE_DEACTIVATED
                .                   .                qdisc_reset()
                .                   .                     .
                .<---------------   .              synchronize_net()
      clear __QDISC_STATE_SCHED  |  .                     .
                .                |  .                     .
                .                |  .            some_qdisc_is_busy()
                .                |  .               return *false*
                .                |  .                     .
        test STATE_DEACTIVATED   |  .                     .
      __qdisc_run() *not* called |  .                     .
                .                |  .                     .
         test STATE_MISS         |  .                     .
       __netif_schedule()--------|  .                     .
                .                   .                     .
                .                   .                     .
      
      __qdisc_run() is not called by net_tx_atcion() in CPU0 because
      CPU2 has set STATE_DEACTIVATED flag during dev_deactivate(), and
      STATE_MISSED is only cleared in __qdisc_run(), __netif_schedule
      is called at the end of qdisc_run_end(), causing tx action
      rescheduling problem.
      
      qdisc_run() called by net_tx_action() runs in the softirq context,
      which should has the same semantic as the qdisc_run() called by
      __dev_xmit_skb() protected by rcu_read_lock_bh(). And there is a
      synchronize_net() between STATE_DEACTIVATED flag being set and
      qdisc_reset()/some_qdisc_is_busy in dev_deactivate(), we can safely
      bail out for the deactived lockless qdisc in net_tx_action(), and
      qdisc_reset() will reset all skb not dequeued yet.
      
      So add the rcu_read_lock() explicitly to protect the qdisc_run()
      and do the STATE_DEACTIVATED checking in net_tx_action() before
      calling qdisc_run_begin(). Another option is to do the checking in
      the qdisc_run_end(), but it will add unnecessary overhead for
      non-tx_action case, because __dev_queue_xmit() will not see qdisc
      with STATE_DEACTIVATED after synchronize_net(), the qdisc with
      STATE_DEACTIVATED can only be seen by net_tx_action() because of
      __netif_schedule().
      
      The STATE_DEACTIVATED checking in qdisc_run() is to avoid race
      between net_tx_action() and qdisc_reset(), see:
      commit d518d2ed ("net/sched: fix race between deactivation
      and dequeue for NOLOCK qdisc"). As the bailout added above for
      deactived lockless qdisc in net_tx_action() provides better
      protection for the race without calling qdisc_run() at all, so
      remove the STATE_DEACTIVATED checking in qdisc_run().
      
      After qdisc_reset(), there is no skb in qdisc to be dequeued, so
      clear the STATE_MISSED in dev_reset_queue() too.
      
      Fixes: 6b3ba914 ("net: sched: allow qdiscs to handle locking")
      Acked-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NYunsheng Lin <linyunsheng@huawei.com>
      V8: Clearing STATE_MISSED before calling __netif_schedule() has
          avoid the endless rescheduling problem, but there may still
          be a unnecessary rescheduling, so adjust the commit log.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      102b55ee
  2. 23 4月, 2021 1 次提交
  3. 20 4月, 2021 1 次提交
  4. 14 4月, 2021 1 次提交
    • E
      gro: ensure frag0 meets IP header alignment · 38ec4944
      Eric Dumazet 提交于
      After commit 0f6925b3 ("virtio_net: Do not pull payload in skb->head")
      Guenter Roeck reported one failure in his tests using sh architecture.
      
      After much debugging, we have been able to spot silent unaligned accesses
      in inet_gro_receive()
      
      The issue at hand is that upper networking stacks assume their header
      is word-aligned. Low level drivers are supposed to reserve NET_IP_ALIGN
      bytes before the Ethernet header to make that happen.
      
      This patch hardens skb_gro_reset_offset() to not allow frag0 fast-path
      if the fragment is not properly aligned.
      
      Some arches like x86, arm64 and powerpc do not care and define NET_IP_ALIGN
      as 0, this extra check will be a NOP for them.
      
      Note that if frag0 is not used, GRO will call pskb_may_pull()
      as many times as needed to pull network and transport headers.
      
      Fixes: 0f6925b3 ("virtio_net: Do not pull payload in skb->head")
      Fixes: 78a478d0 ("gro: Inline skb_gro_header and cache frag0 virtual address")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NGuenter Roeck <linux@roeck-us.net>
      Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Acked-by: NMichael S. Tsirkin <mst@redhat.com>
      Tested-by: NGuenter Roeck <linux@roeck-us.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      38ec4944
  5. 10 4月, 2021 1 次提交
  6. 08 4月, 2021 1 次提交
  7. 06 4月, 2021 1 次提交
  8. 26 3月, 2021 1 次提交
  9. 25 3月, 2021 1 次提交
    • P
      net: resolve forwarding path from virtual netdevice and HW destination address · ddb94eaf
      Pablo Neira Ayuso 提交于
      This patch adds dev_fill_forward_path() which resolves the path to reach
      the real netdevice from the IP forwarding side. This function takes as
      input the netdevice and the destination hardware address and it walks
      down the devices calling .ndo_fill_forward_path() for each device until
      the real device is found.
      
      For instance, assuming the following topology:
      
                     IP forwarding
                    /             \
                 br0              eth0
                 / \
             eth1  eth2
              .
              .
              .
             ethX
       ab:cd:ef:ab:cd:ef
      
      where eth1 and eth2 are bridge ports and eth0 provides WAN connectivity.
      ethX is the interface in another box which is connected to the eth1
      bridge port.
      
      For packets going through IP forwarding to br0 whose destination MAC
      address is ab:cd:ef:ab:cd:ef, dev_fill_forward_path() provides the
      following path:
      
      	br0 -> eth1
      
      .ndo_fill_forward_path for br0 looks up at the FDB for the bridge port
      from the destination MAC address to get the bridge port eth1.
      
      This information allows to create a fast path that bypasses the classic
      bridge and IP forwarding paths, so packets go directly from the bridge
      port eth1 to eth0 (wan interface) and vice versa.
      
                   fast path
            .------------------------.
           /                          \
          |           IP forwarding   |
          |          /             \  \/
          |       br0               eth0
          .       / \
           -> eth1  eth2
              .
              .
              .
             ethX
       ab:cd:ef:ab:cd:ef
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ddb94eaf
  10. 24 3月, 2021 1 次提交
    • D
      net: make unregister netdev warning timeout configurable · 5aa3afe1
      Dmitry Vyukov 提交于
      netdev_wait_allrefs() issues a warning if refcount does not drop to 0
      after 10 seconds. While 10 second wait generally should not happen
      under normal workload in normal environment, it seems to fire falsely
      very often during fuzzing and/or in qemu emulation (~10x slower).
      At least it's not possible to understand if it's really a false
      positive or not. Automated testing generally bumps all timeouts
      to very high values to avoid flake failures.
      Add net.core.netdev_unregister_timeout_secs sysctl to make
      the timeout configurable for automated testing systems.
      Lowering the timeout may also be useful for e.g. manual bisection.
      The default value matches the current behavior.
      Signed-off-by: NDmitry Vyukov <dvyukov@google.com>
      Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=211877
      Cc: netdev@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5aa3afe1
  11. 23 3月, 2021 2 次提交
  12. 20 3月, 2021 1 次提交
    • E
      net: add CONFIG_PCPU_DEV_REFCNT · 919067cc
      Eric Dumazet 提交于
      I was working on a syzbot issue, claiming one device could not be
      dismantled because its refcount was -1
      
      unregister_netdevice: waiting for sit0 to become free. Usage count = -1
      
      It would be nice if syzbot could trigger a warning at the time
      this reference count became negative.
      
      This patch adds CONFIG_PCPU_DEV_REFCNT options which defaults
      to per cpu variables (as before this patch) on SMP builds.
      
      v2: free_dev label in alloc_netdev_mqs() is moved to avoid
          a compiler warning (-Wunused-label), as reported
          by kernel test robot <lkp@intel.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      919067cc
  13. 19 3月, 2021 9 次提交
  14. 18 3月, 2021 1 次提交
    • W
      net: fix race between napi kthread mode and busy poll · cb038357
      Wei Wang 提交于
      Currently, napi_thread_wait() checks for NAPI_STATE_SCHED bit to
      determine if the kthread owns this napi and could call napi->poll() on
      it. However, if socket busy poll is enabled, it is possible that the
      busy poll thread grabs this SCHED bit (after the previous napi->poll()
      invokes napi_complete_done() and clears SCHED bit) and tries to poll
      on the same napi. napi_disable() could grab the SCHED bit as well.
      This patch tries to fix this race by adding a new bit
      NAPI_STATE_SCHED_THREADED in napi->state. This bit gets set in
      ____napi_schedule() if the threaded mode is enabled, and gets cleared
      in napi_complete_done(), and we only poll the napi in kthread if this
      bit is set. This helps distinguish the ownership of the napi between
      kthread and other scenarios and fixes the race issue.
      
      Fixes: 29863d41 ("net: implement threaded-able napi poll loop support")
      Reported-by: NMartin Zaharinov <micron10@gmail.com>
      Suggested-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NWei Wang <weiwan@google.com>
      Cc: Alexander Duyck <alexanderduyck@fb.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Paolo Abeni <pabeni@redhat.com>
      Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cb038357
  15. 16 3月, 2021 2 次提交
    • M
      can: dev: Move device back to init netns on owning netns delete · 3a5ca857
      Martin Willi 提交于
      When a non-initial netns is destroyed, the usual policy is to delete
      all virtual network interfaces contained, but move physical interfaces
      back to the initial netns. This keeps the physical interface visible
      on the system.
      
      CAN devices are somewhat special, as they define rtnl_link_ops even
      if they are physical devices. If a CAN interface is moved into a
      non-initial netns, destroying that netns lets the interface vanish
      instead of moving it back to the initial netns. default_device_exit()
      skips CAN interfaces due to having rtnl_link_ops set. Reproducer:
      
        ip netns add foo
        ip link set can0 netns foo
        ip netns delete foo
      
      WARNING: CPU: 1 PID: 84 at net/core/dev.c:11030 ops_exit_list+0x38/0x60
      CPU: 1 PID: 84 Comm: kworker/u4:2 Not tainted 5.10.19 #1
      Workqueue: netns cleanup_net
      [<c010e700>] (unwind_backtrace) from [<c010a1d8>] (show_stack+0x10/0x14)
      [<c010a1d8>] (show_stack) from [<c086dc10>] (dump_stack+0x94/0xa8)
      [<c086dc10>] (dump_stack) from [<c086b938>] (__warn+0xb8/0x114)
      [<c086b938>] (__warn) from [<c086ba10>] (warn_slowpath_fmt+0x7c/0xac)
      [<c086ba10>] (warn_slowpath_fmt) from [<c0629f20>] (ops_exit_list+0x38/0x60)
      [<c0629f20>] (ops_exit_list) from [<c062a5c4>] (cleanup_net+0x230/0x380)
      [<c062a5c4>] (cleanup_net) from [<c0142c20>] (process_one_work+0x1d8/0x438)
      [<c0142c20>] (process_one_work) from [<c0142ee4>] (worker_thread+0x64/0x5a8)
      [<c0142ee4>] (worker_thread) from [<c0148a98>] (kthread+0x148/0x14c)
      [<c0148a98>] (kthread) from [<c0100148>] (ret_from_fork+0x14/0x2c)
      
      To properly restore physical CAN devices to the initial netns on owning
      netns exit, introduce a flag on rtnl_link_ops that can be set by drivers.
      For CAN devices setting this flag, default_device_exit() considers them
      non-virtual, applying the usual namespace move.
      
      The issue was introduced in the commit mentioned below, as at that time
      CAN devices did not have a dellink() operation.
      
      Fixes: e008b5fc ("net: Simplfy default_device_exit and improve batching.")
      Link: https://lore.kernel.org/r/20210302122423.872326-1-martin@strongswan.orgSigned-off-by: NMartin Willi <martin@strongswan.org>
      Signed-off-by: NMarc Kleine-Budde <mkl@pengutronix.de>
      3a5ca857
    • L
      net: export dev_set_threaded symbol · 8f64860f
      Lorenzo Bianconi 提交于
      For wireless devices (e.g. mt76 driver) multiple net_devices belongs to
      the same wireless phy and the napi object is registered in a dummy
      netdevice related to the wireless phy.
      Export dev_set_threaded in order to be reused in device drivers enabling
      threaded NAPI.
      Signed-off-by: NLorenzo Bianconi <lorenzo@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8f64860f
  16. 15 3月, 2021 3 次提交
  17. 11 3月, 2021 1 次提交
  18. 14 2月, 2021 2 次提交
  19. 13 2月, 2021 1 次提交
  20. 12 2月, 2021 1 次提交
    • C
      net: fix dev_ifsioc_locked() race condition · 3b23a32a
      Cong Wang 提交于
      dev_ifsioc_locked() is called with only RCU read lock, so when
      there is a parallel writer changing the mac address, it could
      get a partially updated mac address, as shown below:
      
      Thread 1			Thread 2
      // eth_commit_mac_addr_change()
      memcpy(dev->dev_addr, addr->sa_data, ETH_ALEN);
      				// dev_ifsioc_locked()
      				memcpy(ifr->ifr_hwaddr.sa_data,
      					dev->dev_addr,...);
      
      Close this race condition by guarding them with a RW semaphore,
      like netdev_get_name(). We can not use seqlock here as it does not
      allow blocking. The writers already take RTNL anyway, so this does
      not affect the slow path. To avoid bothering existing
      dev_set_mac_address() callers in drivers, introduce a new wrapper
      just for user-facing callers on ioctl and rtnetlink paths.
      
      Note, bonding also changes slave mac addresses but that requires
      a separate patch due to the complexity of bonding code.
      
      Fixes: 3710becf ("net: RCU locking for simple ioctl()")
      Reported-by: N"Gong, Sishuai" <sishuai@purdue.edu>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Jakub Kicinski <kuba@kernel.org>
      Signed-off-by: NCong Wang <cong.wang@bytedance.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3b23a32a
  21. 10 2月, 2021 3 次提交
  22. 06 2月, 2021 1 次提交
  23. 05 2月, 2021 1 次提交
    • L
      net/core: move gro function declarations to separate header · 04f00ab2
      Leon Romanovsky 提交于
      Fir the following compilation warnings:
       1031 | INDIRECT_CALLABLE_SCOPE void udp_v6_early_demux(struct sk_buff *skb)
      
      net/ipv6/ip6_offload.c:182:41: warning: no previous prototype for ‘ipv6_gro_receive’ [-Wmissing-prototypes]
        182 | INDIRECT_CALLABLE_SCOPE struct sk_buff *ipv6_gro_receive(struct list_head *head,
            |                                         ^~~~~~~~~~~~~~~~
      net/ipv6/ip6_offload.c:320:29: warning: no previous prototype for ‘ipv6_gro_complete’ [-Wmissing-prototypes]
        320 | INDIRECT_CALLABLE_SCOPE int ipv6_gro_complete(struct sk_buff *skb, int nhoff)
            |                             ^~~~~~~~~~~~~~~~~
      net/ipv6/ip6_offload.c:182:41: warning: no previous prototype for ‘ipv6_gro_receive’ [-Wmissing-prototypes]
        182 | INDIRECT_CALLABLE_SCOPE struct sk_buff *ipv6_gro_receive(struct list_head *head,
            |                                         ^~~~~~~~~~~~~~~~
      net/ipv6/ip6_offload.c:320:29: warning: no previous prototype for ‘ipv6_gro_complete’ [-Wmissing-prototypes]
        320 | INDIRECT_CALLABLE_SCOPE int ipv6_gro_complete(struct sk_buff *skb, int nhoff)
      Signed-off-by: NLeon Romanovsky <leonro@nvidia.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      04f00ab2
  24. 30 1月, 2021 1 次提交
    • X
      net: support ip generic csum processing in skb_csum_hwoffload_help · 62fafcd6
      Xin Long 提交于
      NETIF_F_IP|IPV6_CSUM feature flag indicates UDP and TCP csum offload
      while NETIF_F_HW_CSUM feature flag indicates ip generic csum offload
      for HW, which includes not only for TCP/UDP csum, but also for other
      protocols' csum like GRE's.
      
      However, in skb_csum_hwoffload_help() it only checks features against
      NETIF_F_CSUM_MASK(NETIF_F_HW|IP|IPV6_CSUM). So if it's a non TCP/UDP
      packet and the features doesn't support NETIF_F_HW_CSUM, but supports
      NETIF_F_IP|IPV6_CSUM only, it would still return 0 and leave the HW
      to do csum.
      
      This patch is to support ip generic csum processing by checking
      NETIF_F_HW_CSUM for all protocols, and check (NETIF_F_IP_CSUM |
      NETIF_F_IPV6_CSUM) only for TCP and UDP.
      
      Note that we're using skb->csum_offset to check if it's a TCP/UDP
      proctol, this might be fragile. However, as Alex said, for now we
      only have a few L4 protocols that are requesting Tx csum offload,
      we'd better fix this until a new protocol comes with a same csum
      offset.
      
      v1->v2:
        - not extend skb->csum_not_inet, but use skb->csum_offset to tell
          if it's an UDP/TCP csum packet.
      v2->v3:
        - add a note in the changelog, as Willem suggested.
      Suggested-by: NAlexander Duyck <alexander.duyck@gmail.com>
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      62fafcd6
  25. 23 1月, 2021 1 次提交