1. 18 5月, 2018 1 次提交
  2. 17 5月, 2018 1 次提交
    • P
      sched: manipulate __QDISC_STATE_RUNNING in qdisc_run_* helpers · 32f7b44d
      Paolo Abeni 提交于
      Currently NOLOCK qdiscs pay a measurable overhead to atomically
      manipulate the __QDISC_STATE_RUNNING. Such bit is flipped twice per
      packet in the uncontended scenario with packet rate below the
      line rate: on packed dequeue and on the next, failing dequeue attempt.
      
      This changeset moves the bit manipulation into the qdisc_run_{begin,end}
      helpers, so that the bit is now flipped only once per packet, with
      measurable performance improvement in the uncontended scenario.
      
      This also allows simplifying the qdisc teardown code path - since
      qdisc_is_running() is now effective for each qdisc type - and avoid a
      possible race between qdisc_run() and dev_deactivate_many(), as now
      the some_qdisc_is_busy() can properly detect NOLOCK qdiscs being busy
      dequeuing packets.
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      32f7b44d
  3. 27 3月, 2018 1 次提交
    • J
      net: sched, fix OOO packets with pfifo_fast · eb82a994
      John Fastabend 提交于
      After the qdisc lock was dropped in pfifo_fast we allow multiple
      enqueue threads and dequeue threads to run in parallel. On the
      enqueue side the skb bit ooo_okay is used to ensure all related
      skbs are enqueued in-order. On the dequeue side though there is
      no similar logic. What we observe is with fewer queues than CPUs
      it is possible to re-order packets when two instances of
      __qdisc_run() are running in parallel. Each thread will dequeue
      a skb and then whichever thread calls the ndo op first will
      be sent on the wire. This doesn't typically happen because
      qdisc_run() is usually triggered by the same core that did the
      enqueue. However, drivers will trigger __netif_schedule()
      when queues are transitioning from stopped to awake using the
      netif_tx_wake_* APIs. When this happens netif_schedule() calls
      qdisc_run() on the same CPU that did the netif_tx_wake_* which
      is usually done in the interrupt completion context. This CPU
      is selected with the irq affinity which is unrelated to the
      enqueue operations.
      
      To resolve this we add a RUNNING bit to the qdisc to ensure
      only a single dequeue per qdisc is running. Enqueue and dequeue
      operations can still run in parallel and also on multi queue
      NICs we can still have a dequeue in-flight per qdisc, which
      is typically per CPU.
      
      Fixes: c5ad119f ("net: sched: pfifo_fast use skb_array")
      Reported-by: NJakob Unterwurzacher <jakob.unterwurzacher@theobroma-systems.com>
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      eb82a994
  4. 18 3月, 2018 1 次提交
    • E
      net: sched: fix uses after free · cce6294c
      Eric Dumazet 提交于
      syzbot reported one use-after-free in pfifo_fast_enqueue() [1]
      
      Issue here is that we can not reuse skb after a successful skb_array_produce()
      since another cpu might have consumed it already.
      
      I believe a similar problem exists in try_bulk_dequeue_skb_slow()
      in case we put an skb into qdisc_enqueue_skb_bad_txq() for lockless qdisc.
      
      [1]
      BUG: KASAN: use-after-free in qdisc_pkt_len include/net/sch_generic.h:610 [inline]
      BUG: KASAN: use-after-free in qdisc_qstats_cpu_backlog_inc include/net/sch_generic.h:712 [inline]
      BUG: KASAN: use-after-free in pfifo_fast_enqueue+0x4bc/0x5e0 net/sched/sch_generic.c:639
      Read of size 4 at addr ffff8801cede37e8 by task syzkaller717588/5543
      
      CPU: 1 PID: 5543 Comm: syzkaller717588 Not tainted 4.16.0-rc4+ #265
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:17 [inline]
       dump_stack+0x194/0x24d lib/dump_stack.c:53
       print_address_description+0x73/0x250 mm/kasan/report.c:256
       kasan_report_error mm/kasan/report.c:354 [inline]
       kasan_report+0x23c/0x360 mm/kasan/report.c:412
       __asan_report_load4_noabort+0x14/0x20 mm/kasan/report.c:432
       qdisc_pkt_len include/net/sch_generic.h:610 [inline]
       qdisc_qstats_cpu_backlog_inc include/net/sch_generic.h:712 [inline]
       pfifo_fast_enqueue+0x4bc/0x5e0 net/sched/sch_generic.c:639
       __dev_xmit_skb net/core/dev.c:3216 [inline]
      
      Fixes: c5ad119f ("net: sched: pfifo_fast use skb_array")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: syzbot+ed43b6903ab968b16f54@syzkaller.appspotmail.com
      Cc: John Fastabend <john.fastabend@gmail.com>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc:	Cong Wang <xiyou.wangcong@gmail.com>
      Cc:	Jiri Pirko <jiri@resnulli.us>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cce6294c
  5. 30 1月, 2018 2 次提交
    • C
      net_sched: implement ->change_tx_queue_len() for pfifo_fast · 7007ba63
      Cong Wang 提交于
      pfifo_fast used to drop based on qdisc_dev(qdisc)->tx_queue_len,
      so we have to resize skb array when we change tx_queue_len.
      
      Other qdiscs which read tx_queue_len are fine because they
      all save it to sch->limit or somewhere else in qdisc during init.
      They don't have to implement this, it is nicer if they do so
      that users don't have to re-configure qdisc after changing
      tx_queue_len.
      
      Cc: John Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7007ba63
    • C
      net_sched: plug in qdisc ops change_tx_queue_len · 48bfd55e
      Cong Wang 提交于
      Introduce a new qdisc ops ->change_tx_queue_len() so that
      each qdisc could decide how to implement this if it wants.
      Previously we simply read dev->tx_queue_len, after pfifo_fast
      switches to skb array, we need this API to resize the skb array
      when we change dev->tx_queue_len.
      
      To avoid handling race conditions with TX BH, we need to
      deactivate all TX queues before change the value and bring them
      back after we are done, this also makes implementation easier.
      
      Cc: John Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      48bfd55e
  6. 23 1月, 2018 1 次提交
  7. 17 1月, 2018 1 次提交
    • D
      net, sched: fix panic when updating miniq {b,q}stats · 81d947e2
      Daniel Borkmann 提交于
      While working on fixing another bug, I ran into the following panic
      on arm64 by simply attaching clsact qdisc, adding a filter and running
      traffic on ingress to it:
      
        [...]
        [  178.188591] Unable to handle kernel read from unreadable memory at virtual address 810fb501f000
        [  178.197314] Mem abort info:
        [  178.200121]   ESR = 0x96000004
        [  178.203168]   Exception class = DABT (current EL), IL = 32 bits
        [  178.209095]   SET = 0, FnV = 0
        [  178.212157]   EA = 0, S1PTW = 0
        [  178.215288] Data abort info:
        [  178.218175]   ISV = 0, ISS = 0x00000004
        [  178.222019]   CM = 0, WnR = 0
        [  178.224997] user pgtable: 4k pages, 48-bit VAs, pgd = 0000000023cb3f33
        [  178.231531] [0000810fb501f000] *pgd=0000000000000000
        [  178.236508] Internal error: Oops: 96000004 [#1] SMP
        [...]
        [  178.311855] CPU: 73 PID: 2497 Comm: ping Tainted: G        W        4.15.0-rc7+ #5
        [  178.319413] Hardware name: FOXCONN R2-1221R-A4/C2U4N_MB, BIOS G31FB18A 03/31/2017
        [  178.326887] pstate: 60400005 (nZCv daif +PAN -UAO)
        [  178.331685] pc : __netif_receive_skb_core+0x49c/0xac8
        [  178.336728] lr : __netif_receive_skb+0x28/0x78
        [  178.341161] sp : ffff00002344b750
        [  178.344465] x29: ffff00002344b750 x28: ffff810fbdfd0580
        [  178.349769] x27: 0000000000000000 x26: ffff000009378000
        [...]
        [  178.418715] x1 : 0000000000000054 x0 : 0000000000000000
        [  178.424020] Process ping (pid: 2497, stack limit = 0x000000009f0a3ff4)
        [  178.430537] Call trace:
        [  178.432976]  __netif_receive_skb_core+0x49c/0xac8
        [  178.437670]  __netif_receive_skb+0x28/0x78
        [  178.441757]  process_backlog+0x9c/0x160
        [  178.445584]  net_rx_action+0x2f8/0x3f0
        [...]
      
      Reason is that sch_ingress and sch_clsact are doing mini_qdisc_pair_init()
      which sets up miniq pointers to cpu_{b,q}stats from the underlying qdisc.
      Problem is that this cannot work since they are actually set up right after
      the qdisc ->init() callback in qdisc_create(), so first packet going into
      sch_handle_ingress() tries to call mini_qdisc_bstats_cpu_update() and we
      therefore panic.
      
      In order to fix this, allocation of {b,q}stats needs to happen before we
      call into ->init(). In net-next, there's already such option through commit
      d59f5ffa ("net: sched: a dflt qdisc may be used with per cpu stats").
      However, the bug needs to be fixed in net still for 4.15. Thus, include
      these bits to reduce any merge churn and reuse the static_flags field to
      set TCQ_F_CPUSTATS, and remove the allocation from qdisc_create() since
      there is no other user left. Prashant Bhole ran into the same issue but
      for net-next, thus adding him below as well as co-author. Same issue was
      also reported by Sandipan Das when using bcc.
      
      Fixes: 46209401 ("net: core: introduce mini_Qdisc and eliminate usage of tp->q for clsact fastpath")
      Reference: https://lists.iovisor.org/pipermail/iovisor-dev/2018-January/001190.htmlReported-by: NSandipan Das <sandipan@linux.vnet.ibm.com>
      Co-authored-by: NPrashant Bhole <bhole_prashant_q7@lab.ntt.co.jp>
      Co-authored-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Cc: Jiri Pirko <jiri@resnulli.us>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      81d947e2
  8. 03 1月, 2018 1 次提交
    • W
      net: sched: fix skb leak in dev_requeue_skb() · 9540d977
      Wei Yongjun 提交于
      When dev_requeue_skb() is called with bulked skb list, only the
      first skb of the list will be requeued to qdisc layer, and leak
      the others without free them.
      
      TCP is broken due to skb leak since no free skb will be considered
      as still in the host queue and never be retransmitted. This happend
      when dev_requeue_skb() called from qdisc_restart().
        qdisc_restart
        |-- dequeue_skb
        |-- sch_direct_xmit()
            |-- dev_requeue_skb() <-- skb may bluked
      
      Fix dev_requeue_skb() to requeue the full bluked list. Also change
      to use __skb_queue_tail() in __dev_requeue_skb() to avoid skb out
      of order.
      
      Fixes: a53851e2 ("net: sched: explicit locking in gso_cpu fallback")
      Signed-off-by: NWei Yongjun <weiyongjun1@huawei.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9540d977
  9. 27 12月, 2017 1 次提交
  10. 22 12月, 2017 3 次提交
  11. 20 12月, 2017 2 次提交
  12. 09 12月, 2017 9 次提交
  13. 07 12月, 2017 1 次提交
  14. 03 11月, 2017 1 次提交
  15. 28 10月, 2017 1 次提交
  16. 18 10月, 2017 1 次提交
    • K
      net: sched: Convert timers to use timer_setup() · cdeabbb8
      Kees Cook 提交于
      In preparation for unconditionally passing the struct timer_list pointer to
      all timer callbacks, switch to using the new timer_setup() and from_timer()
      to pass the timer pointer explicitly. Add pointer back to Qdisc.
      
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Jiri Pirko <jiri@resnulli.us>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: netdev@vger.kernel.org
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cdeabbb8
  17. 22 9月, 2017 1 次提交
  18. 20 9月, 2017 1 次提交
  19. 25 8月, 2017 1 次提交
    • E
      net_sched: fix a refcount_t issue with noop_qdisc · 551143d8
      Eric Dumazet 提交于
      syzkaller reported a refcount_t warning [1]
      
      Issue here is that noop_qdisc refcnt was never really considered as
      a true refcount, since qdisc_destroy() found TCQ_F_BUILTIN set :
      
      if (qdisc->flags & TCQ_F_BUILTIN ||
          !refcount_dec_and_test(&qdisc->refcnt)))
      	return;
      
      Meaning that all atomic_inc() we did on noop_qdisc.refcnt were not
      really needed, but harmless until refcount_t came.
      
      To fix this problem, we simply need to not increment noop_qdisc.refcnt,
      since we never decrement it.
      
      [1]
      refcount_t: increment on 0; use-after-free.
      ------------[ cut here ]------------
      WARNING: CPU: 0 PID: 21754 at lib/refcount.c:152 refcount_inc+0x47/0x50 lib/refcount.c:152
      Kernel panic - not syncing: panic_on_warn set ...
      
      CPU: 0 PID: 21754 Comm: syz-executor7 Not tainted 4.13.0-rc6+ #20
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:16 [inline]
       dump_stack+0x194/0x257 lib/dump_stack.c:52
       panic+0x1e4/0x417 kernel/panic.c:180
       __warn+0x1c4/0x1d9 kernel/panic.c:541
       report_bug+0x211/0x2d0 lib/bug.c:183
       fixup_bug+0x40/0x90 arch/x86/kernel/traps.c:190
       do_trap_no_signal arch/x86/kernel/traps.c:224 [inline]
       do_trap+0x260/0x390 arch/x86/kernel/traps.c:273
       do_error_trap+0x120/0x390 arch/x86/kernel/traps.c:310
       do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:323
       invalid_op+0x1e/0x30 arch/x86/entry/entry_64.S:846
      RIP: 0010:refcount_inc+0x47/0x50 lib/refcount.c:152
      RSP: 0018:ffff8801c43477a0 EFLAGS: 00010282
      RAX: 000000000000002b RBX: ffffffff86093c14 RCX: 0000000000000000
      RDX: 000000000000002b RSI: ffffffff8159314e RDI: ffffed0038868ee8
      RBP: ffff8801c43477a8 R08: 0000000000000001 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff86093ac0
      R13: 0000000000000001 R14: ffff8801d0f3bac0 R15: dffffc0000000000
       attach_default_qdiscs net/sched/sch_generic.c:792 [inline]
       dev_activate+0x7d3/0xaa0 net/sched/sch_generic.c:833
       __dev_open+0x227/0x330 net/core/dev.c:1380
       __dev_change_flags+0x695/0x990 net/core/dev.c:6726
       dev_change_flags+0x88/0x140 net/core/dev.c:6792
       dev_ifsioc+0x5a6/0x930 net/core/dev_ioctl.c:256
       dev_ioctl+0x2bc/0xf90 net/core/dev_ioctl.c:554
       sock_do_ioctl+0x94/0xb0 net/socket.c:968
       sock_ioctl+0x2c2/0x440 net/socket.c:1058
       vfs_ioctl fs/ioctl.c:45 [inline]
       do_vfs_ioctl+0x1b1/0x1520 fs/ioctl.c:685
       SYSC_ioctl fs/ioctl.c:700 [inline]
       SyS_ioctl+0x8f/0xc0 fs/ioctl.c:691
      
      Fixes: 7b936405 ("net, sched: convert Qdisc.refcnt from atomic_t to refcount_t")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Cc: Reshetova, Elena <elena.reshetova@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      551143d8
  20. 17 8月, 2017 1 次提交
    • J
      qdisc: add tracepoint qdisc:qdisc_dequeue for dequeued SKBs · e543002f
      Jesper Dangaard Brouer 提交于
      The main purpose of this tracepoint is to monitor bulk dequeue
      in the network qdisc layer, as it cannot be deducted from the
      existing qdisc stats.
      
      The txq_state can be used for determining the reason for zero packet
      dequeues, see enum netdev_queue_state_t.
      
      Notice all packets doesn't necessary activate this tracepoint. As
      qdiscs with flag TCQ_F_CAN_BYPASS, can directly invoke
      sch_direct_xmit() when qdisc_qlen is zero.
      
      Remember that perf record supports filters like:
      
       perf record -e qdisc:qdisc_dequeue \
        --filter 'ifindex == 4 && (packets > 1 || txq_state > 0)'
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e543002f
  21. 05 7月, 2017 1 次提交
  22. 07 4月, 2017 1 次提交
  23. 13 3月, 2017 1 次提交
  24. 30 12月, 2016 1 次提交
    • M
      net: dev_weight: TX/RX orthogonality · 3d48b53f
      Matthias Tafelmeier 提交于
      Oftenly, introducing side effects on packet processing on the other half
      of the stack by adjusting one of TX/RX via sysctl is not desirable.
      There are cases of demand for asymmetric, orthogonal configurability.
      
      This holds true especially for nodes where RPS for RFS usage on top is
      configured and therefore use the 'old dev_weight'. This is quite a
      common base configuration setup nowadays, even with NICs of superior processing
      support (e.g. aRFS).
      
      A good example use case are nodes acting as noSQL data bases with a
      large number of tiny requests and rather fewer but large packets as responses.
      It's affordable to have large budget and rx dev_weights for the
      requests. But as a side effect having this large a number on TX
      processed in one run can overwhelm drivers.
      
      This patch therefore introduces an independent configurability via sysctl to
      userland.
      Signed-off-by: NMatthias Tafelmeier <matthias.tafelmeier@gmx.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3d48b53f
  25. 06 12月, 2016 1 次提交
    • E
      net_sched: gen_estimator: complete rewrite of rate estimators · 1c0d32fd
      Eric Dumazet 提交于
      1) Old code was hard to maintain, due to complex lock chains.
         (We probably will be able to remove some kfree_rcu() in callers)
      
      2) Using a single timer to update all estimators does not scale.
      
      3) Code was buggy on 32bit kernel (WRITE_ONCE() on 64bit quantity
         is not supposed to work well)
      
      In this rewrite :
      
      - I removed the RB tree that had to be scanned in
        gen_estimator_active(). qdisc dumps should be much faster.
      
      - Each estimator has its own timer.
      
      - Estimations are maintained in net_rate_estimator structure,
        instead of dirtying the qdisc. Minor, but part of the simplification.
      
      - Reading the estimator uses RCU and a seqcount to provide proper
        support for 32bit kernels.
      
      - We reduce memory need when estimators are not used, since
        we store a pointer, instead of the bytes/packets counters.
      
      - xt_rateest_mt() no longer has to grab a spinlock.
        (In the future, xt_rateest_tg() could be switched to per cpu counters)
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1c0d32fd
  26. 19 9月, 2016 3 次提交