1. 28 10月, 2021 1 次提交
    • S
      net: sch: eliminate unnecessary RCU waits in mini_qdisc_pair_swap() · 26746382
      Seth Forshee 提交于
      Currently rcu_barrier() is used to ensure that no readers of the
      inactive mini_Qdisc buffer remain before it is reused. This waits for
      any pending RCU callbacks to complete, when all that is actually
      required is to wait for one RCU grace period to elapse after the buffer
      was made inactive. This means that using rcu_barrier() may result in
      unnecessary waits.
      
      To improve this, store the current RCU state when a buffer is made
      inactive and use poll_state_synchronize_rcu() to check whether a full
      grace period has elapsed before reusing it. If a full grace period has
      not elapsed, wait for a grace period to elapse, and in the non-RT case
      use synchronize_rcu_expedited() to hasten it.
      
      Since this approach eliminates the RCU callback it is no longer
      necessary to synchronize_rcu() in the tp_head==NULL case. However, the
      RCU state should still be saved for the previously active buffer.
      
      Before this change I would typically see mini_qdisc_pair_swap() take
      tens of milliseconds to complete. After this change it typcially
      finishes in less than 1 ms, and often it takes just a few microseconds.
      
      Thanks to Paul for walking me through the options for improving this.
      
      Cc: "Paul E. McKenney" <paulmck@kernel.org>
      Signed-off-by: NSeth Forshee <sforshee@digitalocean.com>
      Link: https://lore.kernel.org/r/20211026130700.121189-1-seth@forshee.meSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      26746382
  2. 20 10月, 2021 2 次提交
  3. 18 10月, 2021 4 次提交
    • A
      net: sched: Remove Qdisc::running sequence counter · 29cbcd85
      Ahmed S. Darwish 提交于
      The Qdisc::running sequence counter has two uses:
      
        1. Reliably reading qdisc's tc statistics while the qdisc is running
           (a seqcount read/retry loop at gnet_stats_add_basic()).
      
        2. As a flag, indicating whether the qdisc in question is running
           (without any retry loops).
      
      For the first usage, the Qdisc::running sequence counter write section,
      qdisc_run_begin() => qdisc_run_end(), covers a much wider area than what
      is actually needed: the raw qdisc's bstats update. A u64_stats sync
      point was thus introduced (in previous commits) inside the bstats
      structure itself. A local u64_stats write section is then started and
      stopped for the bstats updates.
      
      Use that u64_stats sync point mechanism for the bstats read/retry loop
      at gnet_stats_add_basic().
      
      For the second qdisc->running usage, a __QDISC_STATE_RUNNING bit flag,
      accessed with atomic bitops, is sufficient. Using a bit flag instead of
      a sequence counter at qdisc_run_begin/end() and qdisc_is_running() leads
      to the SMP barriers implicitly added through raw_read_seqcount() and
      write_seqcount_begin/end() getting removed. All call sites have been
      surveyed though, and no required ordering was identified.
      
      Now that the qdisc->running sequence counter is no longer used, remove
      it.
      
      Note, using u64_stats implies no sequence counter protection for 64-bit
      architectures. This can lead to the qdisc tc statistics "packets" vs.
      "bytes" values getting out of sync on rare occasions. The individual
      values will still be valid.
      Signed-off-by: NAhmed S. Darwish <a.darwish@linutronix.de>
      Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      29cbcd85
    • A
      net: sched: Merge Qdisc::bstats and Qdisc::cpu_bstats data types · 50dc9a85
      Ahmed S. Darwish 提交于
      The only factor differentiating per-CPU bstats data type (struct
      gnet_stats_basic_cpu) from the packed non-per-CPU one (struct
      gnet_stats_basic_packed) was a u64_stats sync point inside the former.
      The two data types are now equivalent: earlier commits added a u64_stats
      sync point to the latter.
      
      Combine both data types into "struct gnet_stats_basic_sync". This
      eliminates redundancy and simplifies the bstats read/write APIs.
      
      Use u64_stats_t for bstats "packets" and "bytes" data types. On 64-bit
      architectures, u64_stats sync points do not use sequence counter
      protection.
      Signed-off-by: NAhmed S. Darwish <a.darwish@linutronix.de>
      Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      50dc9a85
    • A
      net: sched: Protect Qdisc::bstats with u64_stats · 67c9e627
      Ahmed S. Darwish 提交于
      The not-per-CPU variant of qdisc tc (traffic control) statistics,
      Qdisc::gnet_stats_basic_packed bstats, is protected with Qdisc::running
      sequence counter.
      
      This sequence counter is used for reliably protecting bstats reads from
      parallel writes. Meanwhile, the seqcount's write section covers a much
      wider area than bstats update: qdisc_run_begin() => qdisc_run_end().
      
      That read/write section asymmetry can lead to needless retries of the
      read section. To prepare for removing the Qdisc::running sequence
      counter altogether, introduce a u64_stats sync point inside bstats
      instead.
      
      Modify _bstats_update() to start/end the bstats u64_stats write
      section.
      
      For bisectability, and finer commits granularity, the bstats read
      section is still protected with a Qdisc::running read/retry loop and
      qdisc_run_begin/end() still starts/ends that seqcount write section.
      Once all call sites are modified to use _bstats_update(), the
      Qdisc::running seqcount will be removed and bstats read/retry loop will
      be modified to utilize the internal u64_stats sync point.
      
      Note, using u64_stats implies no sequence counter protection for 64-bit
      architectures. This can lead to the statistics "packets" vs. "bytes"
      values getting out of sync on rare occasions. The individual values will
      still be valid.
      
      [bigeasy: Minor commit message edits, init all gnet_stats_basic_packed.]
      Signed-off-by: NAhmed S. Darwish <a.darwish@linutronix.de>
      Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      67c9e627
    • S
      gen_stats: Move remaining users to gnet_stats_add_queue(). · 10940eb7
      Sebastian Andrzej Siewior 提交于
      The gnet_stats_queue::qlen member is only used in the SMP-case.
      
      qdisc_qstats_qlen_backlog() needs to add qdisc_qlen() to qstats.qlen to
      have the same value as that provided by qdisc_qlen_sum().
      
      gnet_stats_copy_queue() needs to overwritte the resulting qstats.qlen
      field whith the caller submitted qlen value. It might be differ from the
      submitted value.
      
      Let both functions use gnet_stats_add_queue() and remove unused
      __gnet_stats_copy_queue().
      Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      10940eb7
  4. 19 9月, 2021 1 次提交
  5. 15 9月, 2021 1 次提交
    • J
      net: sched: update default qdisc visibility after Tx queue cnt changes · 1e080f17
      Jakub Kicinski 提交于
      mq / mqprio make the default child qdiscs visible. They only do
      so for the qdiscs which are within real_num_tx_queues when the
      device is registered. Depending on order of calls in the driver,
      or if user space changes config via ethtool -L the number of
      qdiscs visible under tc qdisc show will differ from the number
      of queues. This is confusing to users and potentially to system
      configuration scripts which try to make sure qdiscs have the
      right parameters.
      
      Add a new Qdisc_ops callback and make relevant qdiscs TTRT.
      
      Note that this uncovers the "shortcut" created by
      commit 1f27cde3 ("net: sched: use pfifo_fast for non real queues")
      The default child qdiscs beyond initial real_num_tx are always
      pfifo_fast, no matter what the sysfs setting is. Fixing this
      gets a little tricky because we'd need to keep a reference
      on whatever the default qdisc was at the time of creation.
      In practice this is likely an non-issue the qdiscs likely have
      to be configured to non-default settings, so whatever user space
      is doing such configuration can replace the pfifos... now that
      it will see them.
      Reported-by: NMatthew Massey <matthewmassey@fb.com>
      Reviewed-by: NDave Taht <dave.taht@gmail.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1e080f17
  6. 02 8月, 2021 1 次提交
    • C
      net_sched: refactor TC action init API · 695176bf
      Cong Wang 提交于
      TC action ->init() API has 10 parameters, it becomes harder
      to read. Some of them are just boolean and can be replaced
      by flags. Similarly for the internal API tcf_action_init()
      and tcf_exts_validate().
      
      This patch converts them to flags and fold them into
      the upper 16 bits of "flags", whose lower 16 bits are still
      reserved for user-space. More specifically, the following
      kernel flags are introduced:
      
      TCA_ACT_FLAGS_POLICE replace 'name' in a few contexts, to
      distinguish whether it is compatible with policer.
      
      TCA_ACT_FLAGS_BIND replaces 'bind', to indicate whether
      this action is bound to a filter.
      
      TCA_ACT_FLAGS_REPLACE  replaces 'ovr' in most contexts,
      means we are replacing an existing action.
      
      TCA_ACT_FLAGS_NO_RTNL replaces 'rtnl_held' but has the
      opposite meaning, because we still hold RTNL in most
      cases.
      
      The only user-space flag TCA_ACT_FLAGS_NO_PERCPU_STATS is
      untouched and still stored as before.
      
      I have tested this patch with tdc and I do not see any
      failure related to this patch.
      Tested-by: NVlad Buslov <vladbu@nvidia.com>
      Acked-by: Jamal Hadi Salim<jhs@mojatatu.com>
      Cc: Jiri Pirko <jiri@resnulli.us>
      Signed-off-by: NCong Wang <cong.wang@bytedance.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      695176bf
  7. 24 6月, 2021 3 次提交
    • Y
      net: sched: remove qdisc->empty for lockless qdisc · d3e0f575
      Yunsheng Lin 提交于
      As MISSED and DRAINING state are used to indicate a non-empty
      qdisc, qdisc->empty is not longer needed, so remove it.
      Acked-by: NJakub Kicinski <kuba@kernel.org>
      Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com> # flexcan
      Signed-off-by: NYunsheng Lin <linyunsheng@huawei.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d3e0f575
    • Y
      net: sched: implement TCQ_F_CAN_BYPASS for lockless qdisc · c4fef01b
      Yunsheng Lin 提交于
      Currently pfifo_fast has both TCQ_F_CAN_BYPASS and TCQ_F_NOLOCK
      flag set, but queue discipline by-pass does not work for lockless
      qdisc because skb is always enqueued to qdisc even when the qdisc
      is empty, see __dev_xmit_skb().
      
      This patch calls sch_direct_xmit() to transmit the skb directly
      to the driver for empty lockless qdisc, which aviod enqueuing
      and dequeuing operation.
      
      As qdisc->empty is not reliable to indicate a empty qdisc because
      there is a time window between enqueuing and setting qdisc->empty.
      So we use the MISSED state added in commit a90c57f2 ("net:
      sched: fix packet stuck problem for lockless qdisc"), which
      indicate there is lock contention, suggesting that it is better
      not to do the qdisc bypass in order to avoid packet out of order
      problem.
      
      In order to make MISSED state reliable to indicate a empty qdisc,
      we need to ensure that testing and clearing of MISSED state is
      within the protection of qdisc->seqlock, only setting MISSED state
      can be done without the protection of qdisc->seqlock. A MISSED
      state testing is added without the protection of qdisc->seqlock to
      aviod doing unnecessary spin_trylock() for contention case.
      
      As the enqueuing is not within the protection of qdisc->seqlock,
      there is still a potential data race as mentioned by Jakub [1]:
      
            thread1               thread2             thread3
      qdisc_run_begin() # true
                              qdisc_run_begin(q)
                                   set(MISSED)
      pfifo_fast_dequeue
        clear(MISSED)
        # recheck the queue
      qdisc_run_end()
                                  enqueue skb1
                                                   qdisc empty # true
                                                qdisc_run_begin() # true
                                                sch_direct_xmit() # skb2
                               qdisc_run_begin()
                                  set(MISSED)
      
      When above happens, skb1 enqueued by thread2 is transmited after
      skb2 is transmited by thread3 because MISSED state setting and
      enqueuing is not under the qdisc->seqlock. If qdisc bypass is
      disabled, skb1 has better chance to be transmited quicker than
      skb2.
      
      This patch does not take care of the above data race, because we
      view this as similar as below:
      Even at the same time CPU1 and CPU2 write the skb to two socket
      which both heading to the same qdisc, there is no guarantee that
      which skb will hit the qdisc first, because there is a lot of
      factor like interrupt/softirq/cache miss/scheduling afffecting
      that.
      
      There are below cases that need special handling:
      1. When MISSED state is cleared before another round of dequeuing
         in pfifo_fast_dequeue(), and __qdisc_run() might not be able to
         dequeue all skb in one round and call __netif_schedule(), which
         might result in a non-empty qdisc without MISSED set. In order
         to avoid this, the MISSED state is set for lockless qdisc and
         __netif_schedule() will be called at the end of qdisc_run_end.
      
      2. The MISSED state also need to be set for lockless qdisc instead
         of calling __netif_schedule() directly when requeuing a skb for
         a similar reason.
      
      3. For netdev queue stopped case, the MISSED case need clearing
         while the netdev queue is stopped, otherwise there may be
         unnecessary __netif_schedule() calling. So a new DRAINING state
         is added to indicate this case, which also indicate a non-empty
         qdisc.
      
      4. As there is already netif_xmit_frozen_or_stopped() checking in
         dequeue_skb() and sch_direct_xmit(), which are both within the
         protection of qdisc->seqlock, but the same checking in
         __dev_xmit_skb() is without the protection, which might cause
         empty indication of a lockless qdisc to be not reliable. So
         remove the checking in __dev_xmit_skb(), and the checking in
         the protection of qdisc->seqlock seems enough to avoid the cpu
         consumption problem for netdev queue stopped case.
      
      1. https://lkml.org/lkml/2021/5/29/215Acked-by: NJakub Kicinski <kuba@kernel.org>
      Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com> # flexcan
      Signed-off-by: NYunsheng Lin <linyunsheng@huawei.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c4fef01b
    • Y
      net: sched: avoid unnecessary seqcount operation for lockless qdisc · dd25296a
      Yunsheng Lin 提交于
      qdisc->running seqcount operation is mainly used to do heuristic
      locking on q->busylock for locked qdisc, see qdisc_is_running()
      and __dev_xmit_skb().
      
      So avoid doing seqcount operation for qdisc with TCQ_F_NOLOCK
      flag.
      Acked-by: NJakub Kicinski <kuba@kernel.org>
      Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com> # flexcan
      Signed-off-by: NYunsheng Lin <linyunsheng@huawei.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dd25296a
  8. 22 6月, 2021 1 次提交
    • Y
      net: sched: add barrier to ensure correct ordering for lockless qdisc · 89837eb4
      Yunsheng Lin 提交于
      The spin_trylock() was assumed to contain the implicit
      barrier needed to ensure the correct ordering between
      STATE_MISSED setting/clearing and STATE_MISSED checking
      in commit a90c57f2 ("net: sched: fix packet stuck
      problem for lockless qdisc").
      
      But it turns out that spin_trylock() only has load-acquire
      semantic, for strongly-ordered system(like x86), the compiler
      barrier implicitly contained in spin_trylock() seems enough
      to ensure the correct ordering. But for weakly-orderly system
      (like arm64), the store-release semantic is needed to ensure
      the correct ordering as clear_bit() and test_bit() is store
      operation, see queued_spin_lock().
      
      So add the explicit barrier to ensure the correct ordering
      for the above case.
      
      Fixes: a90c57f2 ("net: sched: fix packet stuck problem for lockless qdisc")
      Signed-off-by: NYunsheng Lin <linyunsheng@huawei.com>
      Acked-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      89837eb4
  9. 15 5月, 2021 1 次提交
    • Y
      net: sched: fix packet stuck problem for lockless qdisc · a90c57f2
      Yunsheng Lin 提交于
      Lockless qdisc has below concurrent problem:
          cpu0                 cpu1
           .                     .
      q->enqueue                 .
           .                     .
      qdisc_run_begin()          .
           .                     .
      dequeue_skb()              .
           .                     .
      sch_direct_xmit()          .
           .                     .
           .                q->enqueue
           .             qdisc_run_begin()
           .            return and do nothing
           .                     .
      qdisc_run_end()            .
      
      cpu1 enqueue a skb without calling __qdisc_run() because cpu0
      has not released the lock yet and spin_trylock() return false
      for cpu1 in qdisc_run_begin(), and cpu0 do not see the skb
      enqueued by cpu1 when calling dequeue_skb() because cpu1 may
      enqueue the skb after cpu0 calling dequeue_skb() and before
      cpu0 calling qdisc_run_end().
      
      Lockless qdisc has below another concurrent problem when
      tx_action is involved:
      
      cpu0(serving tx_action)     cpu1             cpu2
                .                   .                .
                .              q->enqueue            .
                .            qdisc_run_begin()       .
                .              dequeue_skb()         .
                .                   .            q->enqueue
                .                   .                .
                .             sch_direct_xmit()      .
                .                   .         qdisc_run_begin()
                .                   .       return and do nothing
                .                   .                .
       clear __QDISC_STATE_SCHED    .                .
       qdisc_run_begin()            .                .
       return and do nothing        .                .
                .                   .                .
                .            qdisc_run_end()         .
      
      This patch fixes the above data race by:
      1. If the first spin_trylock() return false and STATE_MISSED is
         not set, set STATE_MISSED and retry another spin_trylock() in
         case other CPU may not see STATE_MISSED after it releases the
         lock.
      2. reschedule if STATE_MISSED is set after the lock is released
         at the end of qdisc_run_end().
      
      For tx_action case, STATE_MISSED is also set when cpu1 is at the
      end if qdisc_run_end(), so tx_action will be rescheduled again
      to dequeue the skb enqueued by cpu2.
      
      Clear STATE_MISSED before retrying a dequeuing when dequeuing
      returns NULL in order to reduce the overhead of the second
      spin_trylock() and __netif_schedule() calling.
      
      Also clear the STATE_MISSED before calling __netif_schedule()
      at the end of qdisc_run_end() to avoid doing another round of
      dequeuing in the pfifo_fast_dequeue().
      
      The performance impact of this patch, tested using pktgen and
      dummy netdev with pfifo_fast qdisc attached:
      
       threads  without+this_patch   with+this_patch      delta
          1        2.61Mpps            2.60Mpps           -0.3%
          2        3.97Mpps            3.82Mpps           -3.7%
          4        5.62Mpps            5.59Mpps           -0.5%
          8        2.78Mpps            2.77Mpps           -0.3%
         16        2.22Mpps            2.22Mpps           -0.0%
      
      Fixes: 6b3ba914 ("net: sched: allow qdiscs to handle locking")
      Acked-by: NJakub Kicinski <kuba@kernel.org>
      Tested-by: NJuergen Gross <jgross@suse.com>
      Signed-off-by: NYunsheng Lin <linyunsheng@huawei.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a90c57f2
  10. 14 3月, 2021 1 次提交
  11. 02 2月, 2021 1 次提交
    • A
      net: sched: replaced invalid qdisc tree flush helper in qdisc_replace · 938e0fcd
      Alexander Ovechkin 提交于
      Commit e5f0e8f8 ("net: sched: introduce and use qdisc tree flush/purge helpers")
      introduced qdisc tree flush/purge helpers, but erroneously used flush helper
      instead of purge helper in qdisc_replace function.
      This issue was found in our CI, that tests various qdisc setups by configuring
      qdisc and sending data through it. Call of invalid helper sporadically leads
      to corruption of vt_tree/cf_tree of hfsc_class that causes kernel oops:
      
       Oops: 0000 [#1] SMP PTI
       CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.11.0-8f6859df #1
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
       RIP: 0010:rb_insert_color+0x18/0x190
       Code: c3 31 c0 c3 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00 00 48 8b 07 48 85 c0 0f 84 05 01 00 00 48 8b 10 f6 c2 01 0f 85 34 01 00 00 <48> 8b 4a 08 49 89 d0 48 39 c1 74 7d 48 85 c9 74 32 f6 01 01 75 2d
       RSP: 0018:ffffc900000b8bb0 EFLAGS: 00010246
       RAX: ffff8881ef4c38b0 RBX: ffff8881d956e400 RCX: ffff8881ef4c38b0
       RDX: 0000000000000000 RSI: ffff8881d956f0a8 RDI: ffff8881d956e4b0
       RBP: 0000000000000000 R08: 000000d5c4e249da R09: 1600000000000000
       R10: ffffc900000b8be0 R11: ffffc900000b8b28 R12: 0000000000000001
       R13: 000000000000005a R14: ffff8881f0905000 R15: ffff8881f0387d00
       FS:  0000000000000000(0000) GS:ffff8881f8b00000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 0000000000000008 CR3: 00000001f4796004 CR4: 0000000000060ee0
       Call Trace:
        <IRQ>
        init_vf.isra.19+0xec/0x250 [sch_hfsc]
        hfsc_enqueue+0x245/0x300 [sch_hfsc]
        ? fib_rules_lookup+0x12a/0x1d0
        ? __dev_queue_xmit+0x4b6/0x930
        ? hfsc_delete_class+0x250/0x250 [sch_hfsc]
        __dev_queue_xmit+0x4b6/0x930
        ? ip6_finish_output2+0x24d/0x590
        ip6_finish_output2+0x24d/0x590
        ? ip6_output+0x6c/0x130
        ip6_output+0x6c/0x130
        ? __ip6_finish_output+0x110/0x110
        mld_sendpack+0x224/0x230
        mld_ifc_timer_expire+0x186/0x2c0
        ? igmp6_group_dropped+0x200/0x200
        call_timer_fn+0x2d/0x150
        run_timer_softirq+0x20c/0x480
        ? tick_sched_do_timer+0x60/0x60
        ? tick_sched_timer+0x37/0x70
        __do_softirq+0xf7/0x2cb
        irq_exit+0xa0/0xb0
        smp_apic_timer_interrupt+0x74/0x150
        apic_timer_interrupt+0xf/0x20
        </IRQ>
      
      Fixes: e5f0e8f8 ("net: sched: introduce and use qdisc tree flush/purge helpers")
      Signed-off-by: NAlexander Ovechkin <ovov@yandex-team.ru>
      Reported-by: NAlexander Kuznetsov <wwfq@yandex-team.ru>
      Acked-by: NDmitry Monakhov <dmtrmonakhov@yandex-team.ru>
      Acked-by: NDmitry Yakunin <zeil@yandex-team.ru>
      Acked-by: NCong Wang <xiyou.wangcong@gmail.com>
      Link: https://lore.kernel.org/r/20210201200049.299153-1-ovov@yandex-team.ruSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      938e0fcd
  12. 23 1月, 2021 2 次提交
  13. 21 1月, 2021 1 次提交
  14. 28 11月, 2020 2 次提交
  15. 03 11月, 2020 1 次提交
    • J
      net: sched: Remove broken definitions and un-hide for !LOCKDEP · a72e9d54
      Jakub Kicinski 提交于
      Currently, variables used only within lockdep expressions are flagged as
      unused, requiring that these variables' declarations be decorated with
      either #ifdef or __maybe_unused.  This results in ugly code.  This commit
      therefore causes the full definitions of the lockdep_tcf_chain_is_locked()
      and lockdep_tcf_proto_is_locked() functions to be visible even when
      lockdep is not enabled, thus removing the need for the previous empty
      functions that were provided in non-lockdep kernels.  This approach
      further relies on dead-code elimination to remove any references to
      functions or variables that are not available in non-lockdep kernels.
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      --
      CC: jhs@mojatatu.com
      CC: xiyou.wangcong@gmail.com
      CC: jiri@resnulli.us
      Signed-off-by: NPaul E. McKenney <paulmck@kernel.org>
      a72e9d54
  16. 09 10月, 2020 1 次提交
  17. 18 9月, 2020 1 次提交
  18. 04 8月, 2020 1 次提交
    • W
      net/sched: act_ct: fix miss set mru for ovs after defrag in act_ct · 038ebb1a
      wenxu 提交于
      When openvswitch conntrack offload with act_ct action. Fragment packets
      defrag in the ingress tc act_ct action and miss the next chain. Then the
      packet pass to the openvswitch datapath without the mru. The over
      mtu packet will be dropped in output action in openvswitch for over mtu.
      
      "kernel: net2: dropped over-mtu packet: 1528 > 1500"
      
      This patch add mru in the tc_skb_ext for adefrag and miss next chain
      situation. And also add mru in the qdisc_skb_cb. The act_ct set the mru
      to the qdisc_skb_cb when the packet defrag. And When the chain miss,
      The mru is set to tc_skb_ext which can be got by ovs datapath.
      
      Fixes: b57dc7c1 ("net/sched: Introduce action ct")
      Signed-off-by: Nwenxu <wenxu@ucloud.cn>
      Reviewed-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      038ebb1a
  19. 17 7月, 2020 1 次提交
  20. 30 6月, 2020 1 次提交
    • P
      net: sched: Pass root lock to Qdisc_ops.enqueue · aebe4426
      Petr Machata 提交于
      A following patch introduces qevents, points in qdisc algorithm where
      packet can be processed by user-defined filters. Should this processing
      lead to a situation where a new packet is to be enqueued on the same port,
      holding the root lock would lead to deadlocks. To solve the issue, qevent
      handler needs to unlock and relock the root lock when necessary.
      
      To that end, add the root lock argument to the qdisc op enqueue, and
      propagate throughout.
      Signed-off-by: NPetr Machata <petrm@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aebe4426
  21. 16 5月, 2020 1 次提交
    • V
      net: sched: introduce terse dump flag · f8ab1807
      Vlad Buslov 提交于
      Add new TCA_DUMP_FLAGS attribute and use it in cls API to request terse
      filter output from classifiers with TCA_DUMP_FLAGS_TERSE flag. This option
      is intended to be used to improve performance of TC filter dump when
      userland only needs to obtain stats and not the whole classifier/action
      data. Extend struct tcf_proto_ops with new terse_dump() callback that must
      be defined by supporting classifier implementations.
      
      Support of the options in specific classifiers and actions is
      implemented in following patches in the series.
      Signed-off-by: NVlad Buslov <vladbu@mellanox.com>
      Reviewed-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f8ab1807
  22. 05 5月, 2020 1 次提交
    • C
      net_sched: fix tcm_parent in tc filter dump · a7df4870
      Cong Wang 提交于
      When we tell kernel to dump filters from root (ffff:ffff),
      those filters on ingress (ffff:0000) are matched, but their
      true parents must be dumped as they are. However, kernel
      dumps just whatever we tell it, that is either ffff:ffff
      or ffff:0000:
      
       $ nl-cls-list --dev=dummy0 --parent=root
       cls basic dev dummy0 id none parent root prio 49152 protocol ip match-all
       cls basic dev dummy0 id :1 parent root prio 49152 protocol ip match-all
       $ nl-cls-list --dev=dummy0 --parent=ffff:
       cls basic dev dummy0 id none parent ffff: prio 49152 protocol ip match-all
       cls basic dev dummy0 id :1 parent ffff: prio 49152 protocol ip match-all
      
      This is confusing and misleading, more importantly this is
      a regression since 4.15, so the old behavior must be restored.
      
      And, when tc filters are installed on a tc class, the parent
      should be the classid, rather than the qdisc handle. Commit
      edf6711c ("net: sched: remove classid and q fields from tcf_proto")
      removed the classid we save for filters, we can just restore
      this classid in tcf_block.
      
      Steps to reproduce this:
       ip li set dev dummy0 up
       tc qd add dev dummy0 ingress
       tc filter add dev dummy0 parent ffff: protocol arp basic action pass
       tc filter show dev dummy0 root
      
      Before this patch:
       filter protocol arp pref 49152 basic
       filter protocol arp pref 49152 basic handle 0x1
      	action order 1: gact action pass
      	 random type none pass val 0
      	 index 1 ref 1 bind 1
      
      After this patch:
       filter parent ffff: protocol arp pref 49152 basic
       filter parent ffff: protocol arp pref 49152 basic handle 0x1
       	action order 1: gact action pass
       	 random type none pass val 0
      	 index 1 ref 1 bind 1
      
      Fixes: a10fa201 ("net: sched: propagate q and parent from caller down to tcf_fill_node")
      Fixes: edf6711c ("net: sched: remove classid and q fields from tcf_proto")
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: Jiri Pirko <jiri@resnulli.us>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a7df4870
  23. 25 4月, 2020 1 次提交
  24. 26 3月, 2020 1 次提交
    • P
      net: Fix CONFIG_NET_CLS_ACT=n and CONFIG_NFT_FWD_NETDEV={y, m} build · 2c64605b
      Pablo Neira Ayuso 提交于
      net/netfilter/nft_fwd_netdev.c: In function ‘nft_fwd_netdev_eval’:
          net/netfilter/nft_fwd_netdev.c:32:10: error: ‘struct sk_buff’ has no member named ‘tc_redirected’
            pkt->skb->tc_redirected = 1;
                    ^~
          net/netfilter/nft_fwd_netdev.c:33:10: error: ‘struct sk_buff’ has no member named ‘tc_from_ingress’
            pkt->skb->tc_from_ingress = 1;
                    ^~
      
      To avoid a direct dependency with tc actions from netfilter, wrap the
      redirect bits around CONFIG_NET_REDIRECT and move helpers to
      include/linux/skbuff.h. Turn on this toggle from the ifb driver, the
      only existing client of these bits in the tree.
      
      This patch adds skb_set_redirected() that sets on the redirected bit
      on the skbuff, it specifies if the packet was redirect from ingress
      and resets the timestamp (timestamp reset was originally missing in the
      netfilter bugfix).
      
      Fixes: bcfabee1 ("netfilter: nft_fwd_netdev: allow to redirect to ifb via ingress")
      Reported-by: noreply@ellerman.id.au
      Reported-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2c64605b
  25. 13 3月, 2020 1 次提交
    • J
      Revert "net: sched: make newly activated qdiscs visible" · 7c4046b1
      Julian Wiedmann 提交于
      This reverts commit 4cda7527
      from net-next.
      
      Brown bag time.
      
      Michal noticed that this change doesn't work at all when
      netif_set_real_num_tx_queues() gets called prior to an initial
      dev_activate(), as for instance igb does.
      
      Doing so dies with:
      
      [   40.579142] BUG: kernel NULL pointer dereference, address: 0000000000000400
      [   40.586922] #PF: supervisor read access in kernel mode
      [   40.592668] #PF: error_code(0x0000) - not-present page
      [   40.598405] PGD 0 P4D 0
      [   40.601234] Oops: 0000 [#1] PREEMPT SMP PTI
      [   40.605909] CPU: 18 PID: 1681 Comm: wickedd Tainted: G            E     5.6.0-rc3-ethnl.50-default #1
      [   40.616205] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS RMLSDP.86I.R3.27.D685.1305151734 05/15/2013
      [   40.627377] RIP: 0010:qdisc_hash_add.part.22+0x2e/0x90
      [   40.633115] Code: 00 55 53 89 f5 48 89 fb e8 2f 9b fb ff 85 c0 74 44 48 8b 43 40 48 8b 08 69 43 38 47 86 c8 61 c1 e8 1c 48 83 e8 80 48 8d 14 c1 <48> 8b 04 c1 48 8d 4b 28 48 89 53 30 48 89 43 28 48 85 c0 48 89 0a
      [   40.654080] RSP: 0018:ffffb879864934d8 EFLAGS: 00010203
      [   40.659914] RAX: 0000000000000080 RBX: ffffffffb8328d80 RCX: 0000000000000000
      [   40.667882] RDX: 0000000000000400 RSI: 0000000000000000 RDI: ffffffffb831faa0
      [   40.675849] RBP: 0000000000000000 R08: ffffa0752c8b9088 R09: ffffa0752c8b9208
      [   40.683816] R10: 0000000000000006 R11: 0000000000000000 R12: ffffa0752d734000
      [   40.691783] R13: 0000000000000008 R14: 0000000000000000 R15: ffffa07113c18000
      [   40.699750] FS:  00007f94548e5880(0000) GS:ffffa0752e980000(0000) knlGS:0000000000000000
      [   40.708782] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [   40.715189] CR2: 0000000000000400 CR3: 000000082b6ae006 CR4: 00000000001606e0
      [   40.723156] Call Trace:
      [   40.725888]  dev_qdisc_set_real_num_tx_queues+0x61/0x90
      [   40.731725]  netif_set_real_num_tx_queues+0x94/0x1d0
      [   40.737286]  __igb_open+0x19a/0x5d0 [igb]
      [   40.741767]  __dev_open+0xbb/0x150
      [   40.745567]  __dev_change_flags+0x157/0x1a0
      [   40.750240]  dev_change_flags+0x23/0x60
      
      [...]
      
      Fixes: 4cda7527 ("net: sched: make newly activated qdiscs visible")
      Reported-by: NMichal Kubecek <mkubecek@suse.cz>
      CC: Michal Kubecek <mkubecek@suse.cz>
      CC: Eric Dumazet <edumazet@google.com>
      CC: Jamal Hadi Salim <jhs@mojatatu.com>
      CC: Cong Wang <xiyou.wangcong@gmail.com>
      CC: Jiri Pirko <jiri@resnulli.us>
      Signed-off-by: NJulian Wiedmann <jwi@linux.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7c4046b1
  26. 12 3月, 2020 1 次提交
    • J
      net: sched: make newly activated qdiscs visible · 4cda7527
      Julian Wiedmann 提交于
      In their .attach callback, mq[prio] only add the qdiscs of the currently
      active TX queues to the device's qdisc hash list.
      If a user later increases the number of active TX queues, their qdiscs
      are not visible via eg. 'tc qdisc show'.
      
      Add a hook to netif_set_real_num_tx_queues() that walks all active
      TX queues and adds those which are missing to the hash list.
      
      CC: Eric Dumazet <edumazet@google.com>
      CC: Jamal Hadi Salim <jhs@mojatatu.com>
      CC: Cong Wang <xiyou.wangcong@gmail.com>
      CC: Jiri Pirko <jiri@resnulli.us>
      Signed-off-by: NJulian Wiedmann <jwi@linux.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4cda7527
  27. 20 2月, 2020 1 次提交
  28. 27 1月, 2020 1 次提交
    • C
      net_sched: fix ops->bind_class() implementations · 2e24cd75
      Cong Wang 提交于
      The current implementations of ops->bind_class() are merely
      searching for classid and updating class in the struct tcf_result,
      without invoking either of cl_ops->bind_tcf() or
      cl_ops->unbind_tcf(). This breaks the design of them as qdisc's
      like cbq use them to count filters too. This is why syzbot triggered
      the warning in cbq_destroy_class().
      
      In order to fix this, we have to call cl_ops->bind_tcf() and
      cl_ops->unbind_tcf() like the filter binding path. This patch does
      so by refactoring out two helper functions __tcf_bind_filter()
      and __tcf_unbind_filter(), which are lockless and accept a Qdisc
      pointer, then teaching each implementation to call them correctly.
      
      Note, we merely pass the Qdisc pointer as an opaque pointer to
      each filter, they only need to pass it down to the helper
      functions without understanding it at all.
      
      Fixes: 07d79fc7 ("net_sched: add reverse binding for tc class")
      Reported-and-tested-by: syzbot+0a0596220218fcb603a8@syzkaller.appspotmail.com
      Reported-and-tested-by: syzbot+63bdb6006961d8c917c6@syzkaller.appspotmail.com
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: Jiri Pirko <jiri@resnulli.us>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2e24cd75
  29. 31 12月, 2019 1 次提交
    • D
      net/sched: add delete_empty() to filters and use it in cls_flower · a5b72a08
      Davide Caratti 提交于
      Revert "net/sched: cls_u32: fix refcount leak in the error path of
      u32_change()", and fix the u32 refcount leak in a more generic way that
      preserves the semantic of rule dumping.
      On tc filters that don't support lockless insertion/removal, there is no
      need to guard against concurrent insertion when a removal is in progress.
      Therefore, for most of them we can avoid a full walk() when deleting, and
      just decrease the refcount, like it was done on older Linux kernels.
      This fixes situations where walk() was wrongly detecting a non-empty
      filter, like it happened with cls_u32 in the error path of change(), thus
      leading to failures in the following tdc selftests:
      
       6aa7: (filter, u32) Add/Replace u32 with source match and invalid indev
       6658: (filter, u32) Add/Replace u32 with custom hash table and invalid handle
       74c2: (filter, u32) Add/Replace u32 filter with invalid hash table id
      
      On cls_flower, and on (future) lockless filters, this check is necessary:
      move all the check_empty() logic in a callback so that each filter
      can have its own implementation. For cls_flower, it's sufficient to check
      if no IDRs have been allocated.
      
      This reverts commit 275c44aa.
      
      Changes since v1:
       - document the need for delete_empty() when TCF_PROTO_OPS_DOIT_UNLOCKED
         is used, thanks to Vlad Buslov
       - implement delete_empty() without doing fl_walk(), thanks to Vlad Buslov
       - squash revert and new fix in a single patch, to be nice with bisect
         tests that run tdc on u32 filter, thanks to Dave Miller
      
      Fixes: 275c44aa ("net/sched: cls_u32: fix refcount leak in the error path of u32_change()")
      Fixes: 6676d5e4 ("net: sched: set dedicated tcf_walker flag when tp is empty")
      Suggested-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Suggested-by: NVlad Buslov <vladbu@mellanox.com>
      Signed-off-by: NDavide Caratti <dcaratti@redhat.com>
      Reviewed-by: NVlad Buslov <vladbu@mellanox.com>
      Tested-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Acked-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a5b72a08
  30. 09 11月, 2019 1 次提交
    • E
      net/sched: annotate lockless accesses to qdisc->empty · 90b2be27
      Eric Dumazet 提交于
      KCSAN reported the following race [1]
      
      BUG: KCSAN: data-race in __dev_queue_xmit / net_tx_action
      
      read to 0xffff8880ba403508 of 1 bytes by task 21814 on cpu 1:
       __dev_xmit_skb net/core/dev.c:3389 [inline]
       __dev_queue_xmit+0x9db/0x1b40 net/core/dev.c:3761
       dev_queue_xmit+0x21/0x30 net/core/dev.c:3825
       neigh_hh_output include/net/neighbour.h:500 [inline]
       neigh_output include/net/neighbour.h:509 [inline]
       ip6_finish_output2+0x873/0xec0 net/ipv6/ip6_output.c:116
       __ip6_finish_output net/ipv6/ip6_output.c:142 [inline]
       __ip6_finish_output+0x2d7/0x330 net/ipv6/ip6_output.c:127
       ip6_finish_output+0x41/0x160 net/ipv6/ip6_output.c:152
       NF_HOOK_COND include/linux/netfilter.h:294 [inline]
       ip6_output+0xf2/0x280 net/ipv6/ip6_output.c:175
       dst_output include/net/dst.h:436 [inline]
       ip6_local_out+0x74/0x90 net/ipv6/output_core.c:179
       ip6_send_skb+0x53/0x110 net/ipv6/ip6_output.c:1795
       udp_v6_send_skb.isra.0+0x3ec/0xa70 net/ipv6/udp.c:1173
       udpv6_sendmsg+0x1906/0x1c20 net/ipv6/udp.c:1471
       inet6_sendmsg+0x6d/0x90 net/ipv6/af_inet6.c:576
       sock_sendmsg_nosec net/socket.c:637 [inline]
       sock_sendmsg+0x9f/0xc0 net/socket.c:657
       ___sys_sendmsg+0x2b7/0x5d0 net/socket.c:2311
       __sys_sendmmsg+0x123/0x350 net/socket.c:2413
       __do_sys_sendmmsg net/socket.c:2442 [inline]
       __se_sys_sendmmsg net/socket.c:2439 [inline]
       __x64_sys_sendmmsg+0x64/0x80 net/socket.c:2439
       do_syscall_64+0xcc/0x370 arch/x86/entry/common.c:290
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      write to 0xffff8880ba403508 of 1 bytes by interrupt on cpu 0:
       qdisc_run_begin include/net/sch_generic.h:160 [inline]
       qdisc_run include/net/pkt_sched.h:120 [inline]
       net_tx_action+0x2b1/0x6c0 net/core/dev.c:4551
       __do_softirq+0x115/0x33f kernel/softirq.c:292
       do_softirq_own_stack+0x2a/0x40 arch/x86/entry/entry_64.S:1082
       do_softirq.part.0+0x6b/0x80 kernel/softirq.c:337
       do_softirq kernel/softirq.c:329 [inline]
       __local_bh_enable_ip+0x76/0x80 kernel/softirq.c:189
       local_bh_enable include/linux/bottom_half.h:32 [inline]
       rcu_read_unlock_bh include/linux/rcupdate.h:688 [inline]
       ip6_finish_output2+0x7bb/0xec0 net/ipv6/ip6_output.c:117
       __ip6_finish_output net/ipv6/ip6_output.c:142 [inline]
       __ip6_finish_output+0x2d7/0x330 net/ipv6/ip6_output.c:127
       ip6_finish_output+0x41/0x160 net/ipv6/ip6_output.c:152
       NF_HOOK_COND include/linux/netfilter.h:294 [inline]
       ip6_output+0xf2/0x280 net/ipv6/ip6_output.c:175
       dst_output include/net/dst.h:436 [inline]
       ip6_local_out+0x74/0x90 net/ipv6/output_core.c:179
       ip6_send_skb+0x53/0x110 net/ipv6/ip6_output.c:1795
       udp_v6_send_skb.isra.0+0x3ec/0xa70 net/ipv6/udp.c:1173
       udpv6_sendmsg+0x1906/0x1c20 net/ipv6/udp.c:1471
       inet6_sendmsg+0x6d/0x90 net/ipv6/af_inet6.c:576
       sock_sendmsg_nosec net/socket.c:637 [inline]
       sock_sendmsg+0x9f/0xc0 net/socket.c:657
       ___sys_sendmsg+0x2b7/0x5d0 net/socket.c:2311
       __sys_sendmmsg+0x123/0x350 net/socket.c:2413
       __do_sys_sendmmsg net/socket.c:2442 [inline]
       __se_sys_sendmmsg net/socket.c:2439 [inline]
       __x64_sys_sendmmsg+0x64/0x80 net/socket.c:2439
       do_syscall_64+0xcc/0x370 arch/x86/entry/common.c:290
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 0 PID: 21817 Comm: syz-executor.2 Not tainted 5.4.0-rc6+ #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      
      Fixes: d518d2ed ("net/sched: fix race between deactivation and dequeue for NOLOCK qdisc")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Cc: Paolo Abeni <pabeni@redhat.com>
      Cc: Davide Caratti <dcaratti@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      90b2be27
  31. 06 11月, 2019 1 次提交
    • J
      net: sched: prevent duplicate flower rules from tcf_proto destroy race · 59eb87cb
      John Hurley 提交于
      When a new filter is added to cls_api, the function
      tcf_chain_tp_insert_unique() looks up the protocol/priority/chain to
      determine if the tcf_proto is duplicated in the chain's hashtable. It then
      creates a new entry or continues with an existing one. In cls_flower, this
      allows the function fl_ht_insert_unque to determine if a filter is a
      duplicate and reject appropriately, meaning that the duplicate will not be
      passed to drivers via the offload hooks. However, when a tcf_proto is
      destroyed it is removed from its chain before a hardware remove hook is
      hit. This can lead to a race whereby the driver has not received the
      remove message but duplicate flows can be accepted. This, in turn, can
      lead to the offload driver receiving incorrect duplicate flows and out of
      order add/delete messages.
      
      Prevent duplicates by utilising an approach suggested by Vlad Buslov. A
      hash table per block stores each unique chain/protocol/prio being
      destroyed. This entry is only removed when the full destroy (and hardware
      offload) has completed. If a new flow is being added with the same
      identiers as a tc_proto being detroyed, then the add request is replayed
      until the destroy is complete.
      
      Fixes: 8b64678e ("net: sched: refactor tp insert/delete for concurrent execution")
      Signed-off-by: NJohn Hurley <john.hurley@netronome.com>
      Signed-off-by: NVlad Buslov <vladbu@mellanox.com>
      Reviewed-by: NSimon Horman <simon.horman@netronome.com>
      Reported-by: NLouis Peens <louis.peens@netronome.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      59eb87cb
  32. 31 10月, 2019 1 次提交