1. 09 10月, 2021 1 次提交
    • A
      net: introduce a function to check if a netdev name is in use · 75ea27d0
      Antoine Tenart 提交于
      __dev_get_by_name is currently used to either retrieve a net device
      reference using its name or to check if a name is already used by a
      registered net device (per ns). In the later case there is no need to
      return a reference to a net device.
      
      Introduce a new helper, netdev_name_in_use, to check if a name is
      currently used by a registered net device without leaking a reference
      the corresponding net device. This helper uses netdev_name_node_lookup
      instead of __dev_get_by_name as we don't need the extra logic retrieving
      a reference to the corresponding net device.
      Signed-off-by: NAntoine Tenart <atenart@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      75ea27d0
  2. 02 10月, 2021 1 次提交
  3. 27 9月, 2021 1 次提交
  4. 20 9月, 2021 1 次提交
    • X
      napi: fix race inside napi_enable · 3765996e
      Xuan Zhuo 提交于
      The process will cause napi.state to contain NAPI_STATE_SCHED and
      not in the poll_list, which will cause napi_disable() to get stuck.
      
      The prefix "NAPI_STATE_" is removed in the figure below, and
      NAPI_STATE_HASHED is ignored in napi.state.
      
                            CPU0       |                   CPU1       | napi.state
      ===============================================================================
      napi_disable()                   |                              | SCHED | NPSVC
      napi_enable()                    |                              |
      {                                |                              |
          smp_mb__before_atomic();     |                              |
          clear_bit(SCHED, &n->state); |                              | NPSVC
                                       | napi_schedule_prep()         | SCHED | NPSVC
                                       | napi_poll()                  |
                                       |   napi_complete_done()       |
                                       |   {                          |
                                       |      if (n->state & (NPSVC | | (1)
                                       |               _BUSY_POLL)))  |
                                       |           return false;      |
                                       |     ................         |
                                       |   }                          | SCHED | NPSVC
                                       |                              |
          clear_bit(NPSVC, &n->state); |                              | SCHED
      }                                |                              |
                                       |                              |
      napi_schedule_prep()             |                              | SCHED | MISSED (2)
      
      (1) Here return direct. Because of NAPI_STATE_NPSVC exists.
      (2) NAPI_STATE_SCHED exists. So not add napi.poll_list to sd->poll_list
      
      Since NAPI_STATE_SCHED already exists and napi is not in the
      sd->poll_list queue, NAPI_STATE_SCHED cannot be cleared and will always
      exist.
      
      1. This will cause this queue to no longer receive packets.
      2. If you encounter napi_disable under the protection of rtnl_lock, it
         will cause the entire rtnl_lock to be locked, affecting the overall
         system.
      
      This patch uses cmpxchg to implement napi_enable(), which ensures that
      there will be no race due to the separation of clear two bits.
      
      Fixes: 2d8bff12 ("netpoll: Close race condition between poll_one_napi and napi_disable")
      Signed-off-by: NXuan Zhuo <xuanzhuo@linux.alibaba.com>
      Reviewed-by: NDust Li <dust.li@linux.alibaba.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3765996e
  5. 15 9月, 2021 1 次提交
    • J
      net: sched: update default qdisc visibility after Tx queue cnt changes · 1e080f17
      Jakub Kicinski 提交于
      mq / mqprio make the default child qdiscs visible. They only do
      so for the qdiscs which are within real_num_tx_queues when the
      device is registered. Depending on order of calls in the driver,
      or if user space changes config via ethtool -L the number of
      qdiscs visible under tc qdisc show will differ from the number
      of queues. This is confusing to users and potentially to system
      configuration scripts which try to make sure qdiscs have the
      right parameters.
      
      Add a new Qdisc_ops callback and make relevant qdiscs TTRT.
      
      Note that this uncovers the "shortcut" created by
      commit 1f27cde3 ("net: sched: use pfifo_fast for non real queues")
      The default child qdiscs beyond initial real_num_tx are always
      pfifo_fast, no matter what the sysfs setting is. Fixing this
      gets a little tricky because we'd need to keep a reference
      on whatever the default qdisc was at the time of creation.
      In practice this is likely an non-issue the qdiscs likely have
      to be configured to non-default settings, so whatever user space
      is doing such configuration can replace the pfifos... now that
      it will see them.
      Reported-by: NMatthew Massey <matthewmassey@fb.com>
      Reviewed-by: NDave Taht <dave.taht@gmail.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1e080f17
  6. 14 8月, 2021 1 次提交
  7. 10 8月, 2021 2 次提交
  8. 05 8月, 2021 2 次提交
  9. 04 8月, 2021 1 次提交
    • J
      net: add netif_set_real_num_queues() for device reconfig · 271e5b7d
      Jakub Kicinski 提交于
      netif_set_real_num_rx_queues() and netif_set_real_num_tx_queues()
      can fail which breaks drivers trying to implement reconfiguration
      in a way that can't leave the device half-broken. In other words
      those functions are incompatible with prepare/commit approach.
      
      Luckily setting real number of queues can fail only if the number
      is increased, meaning that if we order operations correctly we
      can guarantee ending up with either new config (success), or
      the old one (on error).
      
      Provide a helper implementing such logic so that drivers don't
      have to duplicate it.
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      271e5b7d
  10. 03 8月, 2021 1 次提交
  11. 31 7月, 2021 1 次提交
  12. 30 7月, 2021 1 次提交
  13. 29 7月, 2021 2 次提交
    • P
      skbuff: allow 'slow_gro' for skb carring sock reference · 5e10da53
      Paolo Abeni 提交于
      This change leverages the infrastructure introduced by the previous
      patches to allow soft devices passing to the GRO engine owned skbs
      without impacting the fast-path.
      
      It's up to the GRO caller ensuring the slow_gro bit validity before
      invoking the GRO engine. The new helper skb_prepare_for_gro() is
      introduced for that goal.
      
      On slow_gro, skbs are aggregated only with equal sk.
      Additionally, skb truesize on GRO recycle and free is correctly
      updated so that sk wmem is not changed by the GRO processing.
      
      rfc-> v1:
       - fixed bad truesize on dev_gro_receive NAPI_FREE
       - use the existing state bit
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5e10da53
    • P
      net: optimize GRO for the common case. · 9efb4b5b
      Paolo Abeni 提交于
      After the previous patches, at GRO time, skb->slow_gro is
      usually 0, unless the packets comes from some H/W offload
      slowpath or tunnel.
      
      We can optimize the GRO code assuming !skb->slow_gro is likely.
      This remove multiple conditionals in the most common path, at the
      price of an additional one when we hit the above "slow-paths".
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9efb4b5b
  14. 20 7月, 2021 1 次提交
  15. 16 7月, 2021 1 次提交
    • Q
      net_sched: introduce tracepoint trace_qdisc_enqueue() · 70713ddd
      Qitao Xu 提交于
      Tracepoint trace_qdisc_enqueue() is introduced to trace skb at
      the entrance of TC layer on TX side. This is similar to
      trace_qdisc_dequeue():
      
      1. For both we only trace successful cases. The failure cases
         can be traced via trace_kfree_skb().
      
      2. They are called at entrance or exit of TC layer, not for each
         ->enqueue() or ->dequeue(). This is intentional, because
         we want to make trace_qdisc_enqueue() symmetric to
         trace_qdisc_dequeue(), which is easier to use.
      
      The return value of qdisc_enqueue() is not interesting here,
      we have Qdisc's drop packets in ->dequeue(), it is impossible to
      trace them even if we have the return value, the only way to trace
      them is tracing kfree_skb().
      
      We only add information we need to trace ring buffer. If any other
      information is needed, it is easy to extend it without breaking ABI,
      see commit 3dd344ea ("net: tracepoint: exposing sk_family in all
      tcp:tracepoints").
      Reviewed-by: NCong Wang <cong.wang@bytedance.com>
      Signed-off-by: NQitao Xu <qitao.xu@bytedance.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      70713ddd
  16. 13 7月, 2021 1 次提交
    • X
      xdp, net: Fix use-after-free in bpf_xdp_link_release · 5acc7d3e
      Xuan Zhuo 提交于
      The problem occurs between dev_get_by_index() and dev_xdp_attach_link().
      At this point, dev_xdp_uninstall() is called. Then xdp link will not be
      detached automatically when dev is released. But link->dev already
      points to dev, when xdp link is released, dev will still be accessed,
      but dev has been released.
      
      dev_get_by_index()        |
      link->dev = dev           |
                                |      rtnl_lock()
                                |      unregister_netdevice_many()
                                |          dev_xdp_uninstall()
                                |      rtnl_unlock()
      rtnl_lock();              |
      dev_xdp_attach_link()     |
      rtnl_unlock();            |
                                |      netdev_run_todo() // dev released
      bpf_xdp_link_release()    |
          /* access dev.        |
             use-after-free */  |
      
      [   45.966867] BUG: KASAN: use-after-free in bpf_xdp_link_release+0x3b8/0x3d0
      [   45.967619] Read of size 8 at addr ffff00000f9980c8 by task a.out/732
      [   45.968297]
      [   45.968502] CPU: 1 PID: 732 Comm: a.out Not tainted 5.13.0+ #22
      [   45.969222] Hardware name: linux,dummy-virt (DT)
      [   45.969795] Call trace:
      [   45.970106]  dump_backtrace+0x0/0x4c8
      [   45.970564]  show_stack+0x30/0x40
      [   45.970981]  dump_stack_lvl+0x120/0x18c
      [   45.971470]  print_address_description.constprop.0+0x74/0x30c
      [   45.972182]  kasan_report+0x1e8/0x200
      [   45.972659]  __asan_report_load8_noabort+0x2c/0x50
      [   45.973273]  bpf_xdp_link_release+0x3b8/0x3d0
      [   45.973834]  bpf_link_free+0xd0/0x188
      [   45.974315]  bpf_link_put+0x1d0/0x218
      [   45.974790]  bpf_link_release+0x3c/0x58
      [   45.975291]  __fput+0x20c/0x7e8
      [   45.975706]  ____fput+0x24/0x30
      [   45.976117]  task_work_run+0x104/0x258
      [   45.976609]  do_notify_resume+0x894/0xaf8
      [   45.977121]  work_pending+0xc/0x328
      [   45.977575]
      [   45.977775] The buggy address belongs to the page:
      [   45.978369] page:fffffc00003e6600 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x4f998
      [   45.979522] flags: 0x7fffe0000000000(node=0|zone=0|lastcpupid=0x3ffff)
      [   45.980349] raw: 07fffe0000000000 fffffc00003e6708 ffff0000dac3c010 0000000000000000
      [   45.981309] raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
      [   45.982259] page dumped because: kasan: bad access detected
      [   45.982948]
      [   45.983153] Memory state around the buggy address:
      [   45.983753]  ffff00000f997f80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      [   45.984645]  ffff00000f998000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
      [   45.985533] >ffff00000f998080: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
      [   45.986419]                                               ^
      [   45.987112]  ffff00000f998100: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
      [   45.988006]  ffff00000f998180: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
      [   45.988895] ==================================================================
      [   45.989773] Disabling lock debugging due to kernel taint
      [   45.990552] Kernel panic - not syncing: panic_on_warn set ...
      [   45.991166] CPU: 1 PID: 732 Comm: a.out Tainted: G    B             5.13.0+ #22
      [   45.991929] Hardware name: linux,dummy-virt (DT)
      [   45.992448] Call trace:
      [   45.992753]  dump_backtrace+0x0/0x4c8
      [   45.993208]  show_stack+0x30/0x40
      [   45.993627]  dump_stack_lvl+0x120/0x18c
      [   45.994113]  dump_stack+0x1c/0x34
      [   45.994530]  panic+0x3a4/0x7d8
      [   45.994930]  end_report+0x194/0x198
      [   45.995380]  kasan_report+0x134/0x200
      [   45.995850]  __asan_report_load8_noabort+0x2c/0x50
      [   45.996453]  bpf_xdp_link_release+0x3b8/0x3d0
      [   45.997007]  bpf_link_free+0xd0/0x188
      [   45.997474]  bpf_link_put+0x1d0/0x218
      [   45.997942]  bpf_link_release+0x3c/0x58
      [   45.998429]  __fput+0x20c/0x7e8
      [   45.998833]  ____fput+0x24/0x30
      [   45.999247]  task_work_run+0x104/0x258
      [   45.999731]  do_notify_resume+0x894/0xaf8
      [   46.000236]  work_pending+0xc/0x328
      [   46.000697] SMP: stopping secondary CPUs
      [   46.001226] Dumping ftrace buffer:
      [   46.001663]    (ftrace buffer empty)
      [   46.002110] Kernel Offset: disabled
      [   46.002545] CPU features: 0x00000001,23202c00
      [   46.003080] Memory Limit: none
      
      Fixes: aa8d3a71 ("bpf, xdp: Add bpf_link-based XDP attachment API")
      Reported-by: NAbaci <abaci@linux.alibaba.com>
      Signed-off-by: NXuan Zhuo <xuanzhuo@linux.alibaba.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Reviewed-by: NDust Li <dust.li@linux.alibaba.com>
      Acked-by: NAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20210710031635.41649-1-xuanzhuo@linux.alibaba.com
      5acc7d3e
  17. 10 7月, 2021 1 次提交
    • A
      net: do not reuse skbuff allocated from skbuff_fclone_cache in the skb cache · 28b34f01
      Antoine Tenart 提交于
      Some socket buffers allocated in the fclone cache (in __alloc_skb) can
      end-up in the following path[1]:
      
      napi_skb_finish
        __kfree_skb_defer
          napi_skb_cache_put
      
      The issue is napi_skb_cache_put is not fclone friendly and will put
      those skbuff in the skb cache to be reused later, although this cache
      only expects skbuff allocated from skbuff_head_cache. When this happens
      the skbuff is eventually freed using the wrong origin cache, and we can
      see traces similar to:
      
      [ 1223.947534] cache_from_obj: Wrong slab cache. skbuff_head_cache but object is from skbuff_fclone_cache
      [ 1223.948895] WARNING: CPU: 3 PID: 0 at mm/slab.h:442 kmem_cache_free+0x251/0x3e0
      [ 1223.950211] Modules linked in:
      [ 1223.950680] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 5.13.0+ #474
      [ 1223.951587] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-3.fc34 04/01/2014
      [ 1223.953060] RIP: 0010:kmem_cache_free+0x251/0x3e0
      
      Leading sometimes to other memory related issues.
      
      Fix this by using __kfree_skb for fclone skbuff, similar to what is done
      the other place __kfree_skb_defer is called.
      
      [1] At least in setups using veth pairs and tunnels. Building a kernel
          with KASAN we can for example see packets allocated in
          sk_stream_alloc_skb hit the above path and later the issue arises
          when the skbuff is reused.
      
      Fixes: 9243adfc ("skbuff: queue NAPI_MERGED_FREE skbs into NAPI cache instead of freeing")
      Cc: Alexander Lobakin <alobakin@pm.me>
      Signed-off-by: NAntoine Tenart <atenart@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      28b34f01
  18. 08 7月, 2021 4 次提交
  19. 07 7月, 2021 1 次提交
  20. 29 6月, 2021 1 次提交
  21. 24 6月, 2021 1 次提交
    • Y
      net: sched: implement TCQ_F_CAN_BYPASS for lockless qdisc · c4fef01b
      Yunsheng Lin 提交于
      Currently pfifo_fast has both TCQ_F_CAN_BYPASS and TCQ_F_NOLOCK
      flag set, but queue discipline by-pass does not work for lockless
      qdisc because skb is always enqueued to qdisc even when the qdisc
      is empty, see __dev_xmit_skb().
      
      This patch calls sch_direct_xmit() to transmit the skb directly
      to the driver for empty lockless qdisc, which aviod enqueuing
      and dequeuing operation.
      
      As qdisc->empty is not reliable to indicate a empty qdisc because
      there is a time window between enqueuing and setting qdisc->empty.
      So we use the MISSED state added in commit a90c57f2 ("net:
      sched: fix packet stuck problem for lockless qdisc"), which
      indicate there is lock contention, suggesting that it is better
      not to do the qdisc bypass in order to avoid packet out of order
      problem.
      
      In order to make MISSED state reliable to indicate a empty qdisc,
      we need to ensure that testing and clearing of MISSED state is
      within the protection of qdisc->seqlock, only setting MISSED state
      can be done without the protection of qdisc->seqlock. A MISSED
      state testing is added without the protection of qdisc->seqlock to
      aviod doing unnecessary spin_trylock() for contention case.
      
      As the enqueuing is not within the protection of qdisc->seqlock,
      there is still a potential data race as mentioned by Jakub [1]:
      
            thread1               thread2             thread3
      qdisc_run_begin() # true
                              qdisc_run_begin(q)
                                   set(MISSED)
      pfifo_fast_dequeue
        clear(MISSED)
        # recheck the queue
      qdisc_run_end()
                                  enqueue skb1
                                                   qdisc empty # true
                                                qdisc_run_begin() # true
                                                sch_direct_xmit() # skb2
                               qdisc_run_begin()
                                  set(MISSED)
      
      When above happens, skb1 enqueued by thread2 is transmited after
      skb2 is transmited by thread3 because MISSED state setting and
      enqueuing is not under the qdisc->seqlock. If qdisc bypass is
      disabled, skb1 has better chance to be transmited quicker than
      skb2.
      
      This patch does not take care of the above data race, because we
      view this as similar as below:
      Even at the same time CPU1 and CPU2 write the skb to two socket
      which both heading to the same qdisc, there is no guarantee that
      which skb will hit the qdisc first, because there is a lot of
      factor like interrupt/softirq/cache miss/scheduling afffecting
      that.
      
      There are below cases that need special handling:
      1. When MISSED state is cleared before another round of dequeuing
         in pfifo_fast_dequeue(), and __qdisc_run() might not be able to
         dequeue all skb in one round and call __netif_schedule(), which
         might result in a non-empty qdisc without MISSED set. In order
         to avoid this, the MISSED state is set for lockless qdisc and
         __netif_schedule() will be called at the end of qdisc_run_end.
      
      2. The MISSED state also need to be set for lockless qdisc instead
         of calling __netif_schedule() directly when requeuing a skb for
         a similar reason.
      
      3. For netdev queue stopped case, the MISSED case need clearing
         while the netdev queue is stopped, otherwise there may be
         unnecessary __netif_schedule() calling. So a new DRAINING state
         is added to indicate this case, which also indicate a non-empty
         qdisc.
      
      4. As there is already netif_xmit_frozen_or_stopped() checking in
         dequeue_skb() and sch_direct_xmit(), which are both within the
         protection of qdisc->seqlock, but the same checking in
         __dev_xmit_skb() is without the protection, which might cause
         empty indication of a lockless qdisc to be not reliable. So
         remove the checking in __dev_xmit_skb(), and the checking in
         the protection of qdisc->seqlock seems enough to avoid the cpu
         consumption problem for netdev queue stopped case.
      
      1. https://lkml.org/lkml/2021/5/29/215Acked-by: NJakub Kicinski <kuba@kernel.org>
      Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com> # flexcan
      Signed-off-by: NYunsheng Lin <linyunsheng@huawei.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c4fef01b
  22. 22 6月, 2021 1 次提交
  23. 18 6月, 2021 1 次提交
  24. 15 5月, 2021 2 次提交
    • Y
      net: sched: fix tx action reschedule issue with stopped queue · dcad9ee9
      Yunsheng Lin 提交于
      The netdev qeueue might be stopped when byte queue limit has
      reached or tx hw ring is full, net_tx_action() may still be
      rescheduled if STATE_MISSED is set, which consumes unnecessary
      cpu without dequeuing and transmiting any skb because the
      netdev queue is stopped, see qdisc_run_end().
      
      This patch fixes it by checking the netdev queue state before
      calling qdisc_run() and clearing STATE_MISSED if netdev queue is
      stopped during qdisc_run(), the net_tx_action() is rescheduled
      again when netdev qeueue is restarted, see netif_tx_wake_queue().
      
      As there is time window between netif_xmit_frozen_or_stopped()
      checking and STATE_MISSED clearing, between which STATE_MISSED
      may set by net_tx_action() scheduled by netif_tx_wake_queue(),
      so set the STATE_MISSED again if netdev queue is restarted.
      
      Fixes: 6b3ba914 ("net: sched: allow qdiscs to handle locking")
      Reported-by: NMichal Kubecek <mkubecek@suse.cz>
      Acked-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NYunsheng Lin <linyunsheng@huawei.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dcad9ee9
    • Y
      net: sched: fix tx action rescheduling issue during deactivation · 102b55ee
      Yunsheng Lin 提交于
      Currently qdisc_run() checks the STATE_DEACTIVATED of lockless
      qdisc before calling __qdisc_run(), which ultimately clear the
      STATE_MISSED when all the skb is dequeued. If STATE_DEACTIVATED
      is set before clearing STATE_MISSED, there may be rescheduling
      of net_tx_action() at the end of qdisc_run_end(), see below:
      
      CPU0(net_tx_atcion)  CPU1(__dev_xmit_skb)  CPU2(dev_deactivate)
                .                   .                     .
                .            set STATE_MISSED             .
                .           __netif_schedule()            .
                .                   .           set STATE_DEACTIVATED
                .                   .                qdisc_reset()
                .                   .                     .
                .<---------------   .              synchronize_net()
      clear __QDISC_STATE_SCHED  |  .                     .
                .                |  .                     .
                .                |  .            some_qdisc_is_busy()
                .                |  .               return *false*
                .                |  .                     .
        test STATE_DEACTIVATED   |  .                     .
      __qdisc_run() *not* called |  .                     .
                .                |  .                     .
         test STATE_MISS         |  .                     .
       __netif_schedule()--------|  .                     .
                .                   .                     .
                .                   .                     .
      
      __qdisc_run() is not called by net_tx_atcion() in CPU0 because
      CPU2 has set STATE_DEACTIVATED flag during dev_deactivate(), and
      STATE_MISSED is only cleared in __qdisc_run(), __netif_schedule
      is called at the end of qdisc_run_end(), causing tx action
      rescheduling problem.
      
      qdisc_run() called by net_tx_action() runs in the softirq context,
      which should has the same semantic as the qdisc_run() called by
      __dev_xmit_skb() protected by rcu_read_lock_bh(). And there is a
      synchronize_net() between STATE_DEACTIVATED flag being set and
      qdisc_reset()/some_qdisc_is_busy in dev_deactivate(), we can safely
      bail out for the deactived lockless qdisc in net_tx_action(), and
      qdisc_reset() will reset all skb not dequeued yet.
      
      So add the rcu_read_lock() explicitly to protect the qdisc_run()
      and do the STATE_DEACTIVATED checking in net_tx_action() before
      calling qdisc_run_begin(). Another option is to do the checking in
      the qdisc_run_end(), but it will add unnecessary overhead for
      non-tx_action case, because __dev_queue_xmit() will not see qdisc
      with STATE_DEACTIVATED after synchronize_net(), the qdisc with
      STATE_DEACTIVATED can only be seen by net_tx_action() because of
      __netif_schedule().
      
      The STATE_DEACTIVATED checking in qdisc_run() is to avoid race
      between net_tx_action() and qdisc_reset(), see:
      commit d518d2ed ("net/sched: fix race between deactivation
      and dequeue for NOLOCK qdisc"). As the bailout added above for
      deactived lockless qdisc in net_tx_action() provides better
      protection for the race without calling qdisc_run() at all, so
      remove the STATE_DEACTIVATED checking in qdisc_run().
      
      After qdisc_reset(), there is no skb in qdisc to be dequeued, so
      clear the STATE_MISSED in dev_reset_queue() too.
      
      Fixes: 6b3ba914 ("net: sched: allow qdiscs to handle locking")
      Acked-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NYunsheng Lin <linyunsheng@huawei.com>
      V8: Clearing STATE_MISSED before calling __netif_schedule() has
          avoid the endless rescheduling problem, but there may still
          be a unnecessary rescheduling, so adjust the commit log.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      102b55ee
  25. 14 5月, 2021 1 次提交
  26. 23 4月, 2021 1 次提交
  27. 20 4月, 2021 1 次提交
  28. 14 4月, 2021 1 次提交
    • E
      gro: ensure frag0 meets IP header alignment · 38ec4944
      Eric Dumazet 提交于
      After commit 0f6925b3 ("virtio_net: Do not pull payload in skb->head")
      Guenter Roeck reported one failure in his tests using sh architecture.
      
      After much debugging, we have been able to spot silent unaligned accesses
      in inet_gro_receive()
      
      The issue at hand is that upper networking stacks assume their header
      is word-aligned. Low level drivers are supposed to reserve NET_IP_ALIGN
      bytes before the Ethernet header to make that happen.
      
      This patch hardens skb_gro_reset_offset() to not allow frag0 fast-path
      if the fragment is not properly aligned.
      
      Some arches like x86, arm64 and powerpc do not care and define NET_IP_ALIGN
      as 0, this extra check will be a NOP for them.
      
      Note that if frag0 is not used, GRO will call pskb_may_pull()
      as many times as needed to pull network and transport headers.
      
      Fixes: 0f6925b3 ("virtio_net: Do not pull payload in skb->head")
      Fixes: 78a478d0 ("gro: Inline skb_gro_header and cache frag0 virtual address")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NGuenter Roeck <linux@roeck-us.net>
      Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Acked-by: NMichael S. Tsirkin <mst@redhat.com>
      Tested-by: NGuenter Roeck <linux@roeck-us.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      38ec4944
  29. 10 4月, 2021 1 次提交
  30. 08 4月, 2021 1 次提交
  31. 06 4月, 2021 1 次提交
  32. 26 3月, 2021 1 次提交
  33. 25 3月, 2021 1 次提交
    • P
      net: resolve forwarding path from virtual netdevice and HW destination address · ddb94eaf
      Pablo Neira Ayuso 提交于
      This patch adds dev_fill_forward_path() which resolves the path to reach
      the real netdevice from the IP forwarding side. This function takes as
      input the netdevice and the destination hardware address and it walks
      down the devices calling .ndo_fill_forward_path() for each device until
      the real device is found.
      
      For instance, assuming the following topology:
      
                     IP forwarding
                    /             \
                 br0              eth0
                 / \
             eth1  eth2
              .
              .
              .
             ethX
       ab:cd:ef:ab:cd:ef
      
      where eth1 and eth2 are bridge ports and eth0 provides WAN connectivity.
      ethX is the interface in another box which is connected to the eth1
      bridge port.
      
      For packets going through IP forwarding to br0 whose destination MAC
      address is ab:cd:ef:ab:cd:ef, dev_fill_forward_path() provides the
      following path:
      
      	br0 -> eth1
      
      .ndo_fill_forward_path for br0 looks up at the FDB for the bridge port
      from the destination MAC address to get the bridge port eth1.
      
      This information allows to create a fast path that bypasses the classic
      bridge and IP forwarding paths, so packets go directly from the bridge
      port eth1 to eth0 (wan interface) and vice versa.
      
                   fast path
            .------------------------.
           /                          \
          |           IP forwarding   |
          |          /             \  \/
          |       br0               eth0
          .       / \
           -> eth1  eth2
              .
              .
              .
             ethX
       ab:cd:ef:ab:cd:ef
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ddb94eaf