1. 29 11月, 2017 1 次提交
  2. 22 10月, 2017 1 次提交
  3. 17 10月, 2017 1 次提交
  4. 31 8月, 2017 1 次提交
    • N
      sch_cbq: fix null pointer dereferences on init failure · 3501d059
      Nikolay Aleksandrov 提交于
      CBQ can fail on ->init by wrong nl attributes or simply for missing any,
      f.e. if it's set as a default qdisc then TCA_OPTIONS (opt) will be NULL
      when it is activated. The first thing init does is parse opt but it will
      dereference a null pointer if used as a default qdisc, also since init
      failure at default qdisc invokes ->reset() which cancels all timers then
      we'll also dereference two more null pointers (timer->base) as they were
      never initialized.
      
      To reproduce:
      $ sysctl net.core.default_qdisc=cbq
      $ ip l set ethX up
      
      Crash log of the first null ptr deref:
      [44727.907454] BUG: unable to handle kernel NULL pointer dereference at (null)
      [44727.907600] IP: cbq_init+0x27/0x205
      [44727.907676] PGD 59ff4067
      [44727.907677] P4D 59ff4067
      [44727.907742] PUD 59c70067
      [44727.907807] PMD 0
      [44727.907873]
      [44727.907982] Oops: 0000 [#1] SMP
      [44727.908054] Modules linked in:
      [44727.908126] CPU: 1 PID: 21312 Comm: ip Not tainted 4.13.0-rc6+ #60
      [44727.908235] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
      [44727.908477] task: ffff88005ad42700 task.stack: ffff880037214000
      [44727.908672] RIP: 0010:cbq_init+0x27/0x205
      [44727.908838] RSP: 0018:ffff8800372175f0 EFLAGS: 00010286
      [44727.909018] RAX: ffffffff816c3852 RBX: ffff880058c53800 RCX: 0000000000000000
      [44727.909222] RDX: 0000000000000004 RSI: 0000000000000000 RDI: ffff8800372175f8
      [44727.909427] RBP: ffff880037217650 R08: ffffffff81b0f380 R09: 0000000000000000
      [44727.909631] R10: ffff880037217660 R11: 0000000000000020 R12: ffffffff822a44c0
      [44727.909835] R13: ffff880058b92000 R14: 00000000ffffffff R15: 0000000000000001
      [44727.910040] FS:  00007ff8bc583740(0000) GS:ffff88005d880000(0000) knlGS:0000000000000000
      [44727.910339] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [44727.910525] CR2: 0000000000000000 CR3: 00000000371e5000 CR4: 00000000000406e0
      [44727.910731] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [44727.910936] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [44727.911141] Call Trace:
      [44727.911291]  ? lockdep_init_map+0xb6/0x1ba
      [44727.911461]  ? qdisc_alloc+0x14e/0x187
      [44727.911626]  qdisc_create_dflt+0x7a/0x94
      [44727.911794]  ? dev_activate+0x129/0x129
      [44727.911959]  attach_one_default_qdisc+0x36/0x63
      [44727.912132]  netdev_for_each_tx_queue+0x3d/0x48
      [44727.912305]  dev_activate+0x4b/0x129
      [44727.912468]  __dev_open+0xe7/0x104
      [44727.912631]  __dev_change_flags+0xc6/0x15c
      [44727.912799]  dev_change_flags+0x25/0x59
      [44727.912966]  do_setlink+0x30c/0xb3f
      [44727.913129]  ? check_chain_key+0xb0/0xfd
      [44727.913294]  ? check_chain_key+0xb0/0xfd
      [44727.913463]  rtnl_newlink+0x3a4/0x729
      [44727.913626]  ? rtnl_newlink+0x117/0x729
      [44727.913801]  ? ns_capable_common+0xd/0xb1
      [44727.913968]  ? ns_capable+0x13/0x15
      [44727.914131]  rtnetlink_rcv_msg+0x188/0x197
      [44727.914300]  ? rcu_read_unlock+0x3e/0x5f
      [44727.914465]  ? rtnl_newlink+0x729/0x729
      [44727.914630]  netlink_rcv_skb+0x6c/0xce
      [44727.914796]  rtnetlink_rcv+0x23/0x2a
      [44727.914956]  netlink_unicast+0x103/0x181
      [44727.915122]  netlink_sendmsg+0x326/0x337
      [44727.915291]  sock_sendmsg_nosec+0x14/0x3f
      [44727.915459]  sock_sendmsg+0x29/0x2e
      [44727.915619]  ___sys_sendmsg+0x209/0x28b
      [44727.915784]  ? do_raw_spin_unlock+0xcd/0xf8
      [44727.915954]  ? _raw_spin_unlock+0x27/0x31
      [44727.916121]  ? __handle_mm_fault+0x651/0xdb1
      [44727.916290]  ? check_chain_key+0xb0/0xfd
      [44727.916461]  __sys_sendmsg+0x45/0x63
      [44727.916626]  ? __sys_sendmsg+0x45/0x63
      [44727.916792]  SyS_sendmsg+0x19/0x1b
      [44727.916950]  entry_SYSCALL_64_fastpath+0x23/0xc2
      [44727.917125] RIP: 0033:0x7ff8bbc96690
      [44727.917286] RSP: 002b:00007ffc360991e8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
      [44727.917579] RAX: ffffffffffffffda RBX: ffffffff810d278c RCX: 00007ff8bbc96690
      [44727.917783] RDX: 0000000000000000 RSI: 00007ffc36099230 RDI: 0000000000000003
      [44727.917987] RBP: ffff880037217f98 R08: 0000000000000001 R09: 0000000000000003
      [44727.918190] R10: 00007ffc36098fb0 R11: 0000000000000246 R12: 0000000000000006
      [44727.918393] R13: 000000000066f1a0 R14: 00007ffc360a12e0 R15: 0000000000000000
      [44727.918597]  ? trace_hardirqs_off_caller+0xa7/0xcf
      [44727.918774] Code: 41 5f 5d c3 66 66 66 66 90 55 48 8d 56 04 45 31 c9
      49 c7 c0 80 f3 b0 81 48 89 e5 41 55 41 54 53 48 89 fb 48 8d 7d a8 48 83
      ec 48 <0f> b7 0e be 07 00 00 00 83 e9 04 e8 e6 f7 d8 ff 85 c0 0f 88 bb
      [44727.919332] RIP: cbq_init+0x27/0x205 RSP: ffff8800372175f0
      [44727.919516] CR2: 0000000000000000
      
      Fixes: 0fbbeb1b ("[PKT_SCHED]: Fix missing qdisc_destroy() in qdisc_create_dflt()")
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3501d059
  5. 26 8月, 2017 1 次提交
    • W
      net_sched: remove tc class reference counting · 143976ce
      WANG Cong 提交于
      For TC classes, their ->get() and ->put() are always paired, and the
      reference counting is completely useless, because:
      
      1) For class modification and dumping paths, we already hold RTNL lock,
         so all of these ->get(),->change(),->put() are atomic.
      
      2) For filter bindiing/unbinding, we use other reference counter than
         this one, and they should have RTNL lock too.
      
      3) For ->qlen_notify(), it is special because it is called on ->enqueue()
         path, but we already hold qdisc tree lock there, and we hold this
         tree lock when graft or delete the class too, so it should not be gone
         or changed until we release the tree lock.
      
      Therefore, this patch removes ->get() and ->put(), but:
      
      1) Adds a new ->find() to find the pointer to a class by classid, no
         refcnt.
      
      2) Move the original class destroy upon the last refcnt into ->delete(),
         right after releasing tree lock. This is fine because the class is
         already removed from hash when holding the lock.
      
      For those who also use ->put() as ->unbind(), just rename them to reflect
      this change.
      
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Acked-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      143976ce
  6. 17 8月, 2017 1 次提交
  7. 16 8月, 2017 1 次提交
  8. 07 6月, 2017 1 次提交
  9. 18 5月, 2017 2 次提交
  10. 14 4月, 2017 1 次提交
  11. 13 3月, 2017 1 次提交
  12. 11 2月, 2017 1 次提交
  13. 26 12月, 2016 1 次提交
    • T
      ktime: Cleanup ktime_set() usage · 8b0e1953
      Thomas Gleixner 提交于
      ktime_set(S,N) was required for the timespec storage type and is still
      useful for situations where a Seconds and Nanoseconds part of a time value
      needs to be converted. For anything where the Seconds argument is 0, this
      is pointless and can be replaced with a simple assignment.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      8b0e1953
  14. 06 12月, 2016 1 次提交
    • E
      net_sched: gen_estimator: complete rewrite of rate estimators · 1c0d32fd
      Eric Dumazet 提交于
      1) Old code was hard to maintain, due to complex lock chains.
         (We probably will be able to remove some kfree_rcu() in callers)
      
      2) Using a single timer to update all estimators does not scale.
      
      3) Code was buggy on 32bit kernel (WRITE_ONCE() on 64bit quantity
         is not supposed to work well)
      
      In this rewrite :
      
      - I removed the RB tree that had to be scanned in
        gen_estimator_active(). qdisc dumps should be much faster.
      
      - Each estimator has its own timer.
      
      - Estimations are maintained in net_rate_estimator structure,
        instead of dirtying the qdisc. Minor, but part of the simplification.
      
      - Reading the estimator uses RCU and a seqcount to provide proper
        support for 32bit kernels.
      
      - We reduce memory need when estimators are not used, since
        we store a pointer, instead of the bytes/packets counters.
      
      - xt_rateest_mt() no longer has to grab a spinlock.
        (In the future, xt_rateest_tg() could be switched to per cpu counters)
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1c0d32fd
  15. 26 6月, 2016 1 次提交
    • E
      net_sched: drop packets after root qdisc lock is released · 520ac30f
      Eric Dumazet 提交于
      Qdisc performance suffers when packets are dropped at enqueue()
      time because drops (kfree_skb()) are done while qdisc lock is held,
      delaying a dequeue() draining the queue.
      
      Nominal throughput can be reduced by 50 % when this happens,
      at a time we would like the dequeue() to proceed as fast as possible.
      
      Even FQ is vulnerable to this problem, while one of FQ goals was
      to provide some flow isolation.
      
      This patch adds a 'struct sk_buff **to_free' parameter to all
      qdisc->enqueue(), and in qdisc_drop() helper.
      
      I measured a performance increase of up to 12 %, but this patch
      is a prereq so that future batches in enqueue() can fly.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      520ac30f
  16. 11 6月, 2016 2 次提交
    • E
      net_sched: remove generic throttled management · 45f50bed
      Eric Dumazet 提交于
      __QDISC_STATE_THROTTLED bit manipulation is rather expensive
      for HTB and few others.
      
      I already removed it for sch_fq in commit f2600cf0
      ("net: sched: avoid costly atomic operation in fq_dequeue()")
      and so far nobody complained.
      
      When one ore more packets are stuck in one or more throttled
      HTB class, a htb dequeue() performs two atomic operations
      to clear/set __QDISC_STATE_THROTTLED bit, while root qdisc
      lock is held.
      
      Removing this pair of atomic operations bring me a 8 % performance
      increase on 200 TCP_RR tests, in presence of throttled classes.
      
      This patch has no side effect, since nothing actually uses
      disc_is_throttled() anymore.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      45f50bed
    • E
      net_sched: cbq: remove a flaky use of qdisc_is_throttled() · cca605dd
      Eric Dumazet 提交于
      So far no qdisc ever unset the throttled bit at enqueue() time,
      so CBQ usage of qdisc_is_throttled() was flaky.
      
      Since __QDISC_STATE_THROTTLED set/unset is way too expensive
      considering that only CBQ was eventually caring for this status,
      it would make sense to implement a Qdisc ops ->is_throttled()
      if we find that this is needed.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cca605dd
  17. 09 6月, 2016 3 次提交
    • F
      sched: remove qdisc->drop · a09ceb0e
      Florian Westphal 提交于
      after removal of TCA_CBQ_OVL_STRATEGY from cbq scheduler, there are no
      more callers of ->drop() outside of other ->drop functions, i.e.
      nothing calls them.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a09ceb0e
    • F
      cbq: remove TCA_CBQ_POLICE support · dd47c1fa
      Florian Westphal 提交于
      iproute2 doesn't implement any cbq option that results in this attribute
      being sent to kernel.
      
      To make use of it, user would have to
      
      - patch iproute2
      - add a class
      - attach a qdisc to the class (default pfifo doesn't work as
        q->handle is 0 and cbq_set_police() is a no-op in this case)
      - re-'add' the same class (tc class change ...) again
      - user must also specifiy a defmap (e.g. 'split 1:0 defmap 3f'), since
        this 'police' feature relies on its presence
      - the added qdisc must be one of bfifo, pfifo or netem
      
      If all of these conditions are met and _some_ leaf qdiscs, namely
      p/bfifo, netem, plug or tbf would drop a packet, kernel calls back into
      cbq, which will attempt to re-queue the skb into a different class
      as indicated by the parents' defmap entry for TC_PRIO_BESTEFFORT.
      
      [ i.e. we behave as if tc_classify returned TC_ACT_RECLASSIFY ].
      
      This feature, which isn't documented or implemented in iproute2,
      and isn't implemented consistently (most qdiscs like sfq, codel, etc
      drop right away instead of attempting this reclassification) is the
      sole reason for the reshape_fail and __parent member in Qdisc struct.
      
      So remove TCA_CBQ_POLICE support from the kernel, reject it via EOPNOTSUPP
      so userspace knows we don't support it, and then remove no-longer needed
      infrastructure in followup commit.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dd47c1fa
    • F
      cbq: remove TCA_CBQ_OVL_STRATEGY support · c3498d34
      Florian Westphal 提交于
      since initial revision of cbq in 2004 iproute 2 has never implemented
      support for TCA_CBQ_OVL_STRATEGY, which is what needs to be set to
      activate the class->drop() call (TC_CBQ_OVL_DROP strategy must be
      set by userspace value must be set by userspace).
      
      David Miller says:
         It seems really safe to kill this thing off, flag an error if someone
         tries to set the attribute, and therefore kill off all of the
         non-default cbq_ovl_*() functions.
      
      A followup commit can then remove all .drop qdisc methods since this
      removed the only caller.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c3498d34
  18. 08 6月, 2016 1 次提交
    • E
      net: sched: do not acquire qdisc spinlock in qdisc/class stats dump · edb09eb1
      Eric Dumazet 提交于
      Large tc dumps (tc -s {qdisc|class} sh dev ethX) done by Google BwE host
      agent [1] are problematic at scale :
      
      For each qdisc/class found in the dump, we currently lock the root qdisc
      spinlock in order to get stats. Sampling stats every 5 seconds from
      thousands of HTB classes is a challenge when the root qdisc spinlock is
      under high pressure. Not only the dumps take time, they also slow
      down the fast path (queue/dequeue packets) by 10 % to 20 % in some cases.
      
      An audit of existing qdiscs showed that sch_fq_codel is the only qdisc
      that might need the qdisc lock in fq_codel_dump_stats() and
      fq_codel_dump_class_stats()
      
      In v2 of this patch, I now use the Qdisc running seqcount to provide
      consistent reads of packets/bytes counters, regardless of 32/64 bit arches.
      
      I also changed rate estimators to use the same infrastructure
      so that they no longer need to lock root qdisc lock.
      
      [1]
      http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43838.pdfSigned-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: John Fastabend <john.fastabend@gmail.com>
      Cc: Kevin Athey <kda@google.com>
      Cc: Xiaotian Pei <xiaotian@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      edb09eb1
  19. 01 3月, 2016 2 次提交
  20. 28 8月, 2015 1 次提交
    • D
      net: sched: consolidate tc_classify{,_compat} · 3b3ae880
      Daniel Borkmann 提交于
      For classifiers getting invoked via tc_classify(), we always need an
      extra function call into tc_classify_compat(), as both are being
      exported as symbols and tc_classify() itself doesn't do much except
      handling of reclassifications when tp->classify() returned with
      TC_ACT_RECLASSIFY.
      
      CBQ and ATM are the only qdiscs that directly call into tc_classify_compat(),
      all others use tc_classify(). When tc actions are being configured
      out in the kernel, tc_classify() effectively does nothing besides
      delegating.
      
      We could spare this layer and consolidate both functions. pktgen on
      single CPU constantly pushing skbs directly into the netif_receive_skb()
      path with a dummy classifier on ingress qdisc attached, improves
      slightly from 22.3Mpps to 23.1Mpps.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3b3ae880
  21. 30 9月, 2014 4 次提交
  22. 26 9月, 2014 1 次提交
    • E
      net: sched: use pinned timers · 4a8e320c
      Eric Dumazet 提交于
      While using a MQ + NETEM setup, I had confirmation that the default
      timer migration ( /proc/sys/kernel/timer_migration ) is killing us.
      
      Installing this on a receiver side of a TCP_STREAM test, (NIC has 8 TX
      queues) :
      
      EST="est 1sec 4sec"
      for ETH in eth1
      do
       tc qd del dev $ETH root 2>/dev/null
       tc qd add dev $ETH root handle 1: mq
       tc qd add dev $ETH parent 1:1 $EST netem limit 70000 delay 6ms
       tc qd add dev $ETH parent 1:2 $EST netem limit 70000 delay 8ms
       tc qd add dev $ETH parent 1:3 $EST netem limit 70000 delay 10ms
       tc qd add dev $ETH parent 1:4 $EST netem limit 70000 delay 12ms
       tc qd add dev $ETH parent 1:5 $EST netem limit 70000 delay 14ms
       tc qd add dev $ETH parent 1:6 $EST netem limit 70000 delay 16ms
       tc qd add dev $ETH parent 1:7 $EST netem limit 80000 delay 18ms
       tc qd add dev $ETH parent 1:8 $EST netem limit 90000 delay 20ms
      done
      
      We can see that timers get migrated into a single cpu, presumably idle
      at the time timers are set up.
      Then all qdisc dequeues run from this cpu and huge lock contention
      happens. This single cpu is stuck in softirq mode and cannot dequeue
      fast enough.
      
          39.24%  [kernel]          [k] _raw_spin_lock
           2.65%  [kernel]          [k] netem_enqueue
           1.80%  [kernel]          [k] netem_dequeue
           1.63%  [kernel]          [k] copy_user_enhanced_fast_string
           1.45%  [kernel]          [k] _raw_spin_lock_bh
      
      By pinning qdisc timers on the cpu running the qdisc, we respect proper
      XPS setting and remove this lock contention.
      
           5.84%  [kernel]          [k] netem_enqueue
           4.83%  [kernel]          [k] _raw_spin_lock
           2.92%  [kernel]          [k] copy_user_enhanced_fast_string
      
      Current Qdiscs that benefit from this change are :
      
      	netem, cbq, fq, hfsc, tbf, htb.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4a8e320c
  23. 14 9月, 2014 2 次提交
  24. 20 8月, 2014 2 次提交
    • V
      cbq: now_rt removal · 7201c1dd
      Vasily Averin 提交于
      Now q->now_rt is identical to q->now and is not required anymore.
      Signed-off-by: NVasily Averin <vvs@openvz.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7201c1dd
    • V
      cbq: incorrectly low bandwidth setting blocks limited traffic · 73d0f37a
      Vasily Averin 提交于
      Mainstream commit f0f6ee1f ("cbq: incorrect processing of high limits")
      have side effect: if cbq bandwidth setting is less than real interface
      throughput non-limited traffic can delay limited traffic for a very long time.
      
      This happen because of q->now changes incorrectly in cbq_dequeue():
      in described scenario L2T is much greater than real time delay,
      and q->now gets an extra boost for each transmitted packet.
      
      Accumulated boost prevents update q->now, and blocked class can wait
      very long time until (q->now >= cl->undertime) will be true again.
      
      To fix the problem the patch updates q->now on each cbq_update() call.
      L2T-related pre-modification q->now was moved to cbq_update().
      
      My testing confirmed that it fixes the problem and did not discover
      any side-effects
      
      Fixes: f0f6ee1f ("cbq: incorrect processing of high limits")
      Signed-off-by: NVasily Averin <vvs@openvz.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      73d0f37a
  25. 14 3月, 2014 1 次提交
  26. 01 1月, 2014 1 次提交
  27. 20 12月, 2013 1 次提交
  28. 11 12月, 2013 1 次提交
  29. 30 7月, 2013 1 次提交
  30. 11 6月, 2013 1 次提交
    • E
      net_sched: add 64bit rate estimators · 45203a3b
      Eric Dumazet 提交于
      struct gnet_stats_rate_est contains u32 fields, so the bytes per second
      field can wrap at 34360Mbit.
      
      Add a new gnet_stats_rate_est64 structure to get 64bit bps/pps fields,
      and switch the kernel to use this structure natively.
      
      This structure is dumped to user space as a new attribute :
      
      TCA_STATS_RATE_EST64
      
      Old tc command will now display the capped bps (to 34360Mbit), instead
      of wrapped values, and updated tc command will display correct
      information.
      
      Old tc command output, after patch :
      
      eric:~# tc -s -d qd sh dev lo
      qdisc pfifo 8001: root refcnt 2 limit 1000p
       Sent 80868245400 bytes 1978837 pkt (dropped 0, overlimits 0 requeues 0)
       rate 34360Mbit 189696pps backlog 0b 0p requeues 0
      
      This patch carefully reorganizes "struct Qdisc" layout to get optimal
      performance on SMP.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Ben Hutchings <bhutchings@solarflare.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      45203a3b