1. 05 10月, 2014 1 次提交
    • J
      net: sched: suspicious RCU usage in qdisc_watchdog · 1e203c1a
      John Fastabend 提交于
      Suspicious RCU usage in qdisc_watchdog call needs to be done inside
      rcu_read_lock/rcu_read_unlock. And then Qdisc destroy operations
      need to ensure timer is cancelled before removing qdisc structure.
      
      [ 3992.191339] ===============================
      [ 3992.191340] [ INFO: suspicious RCU usage. ]
      [ 3992.191343] 3.17.0-rc6net-next+ #72 Not tainted
      [ 3992.191345] -------------------------------
      [ 3992.191347] include/net/sch_generic.h:272 suspicious rcu_dereference_check() usage!
      [ 3992.191348]
      [ 3992.191348] other info that might help us debug this:
      [ 3992.191348]
      [ 3992.191351]
      [ 3992.191351] rcu_scheduler_active = 1, debug_locks = 1
      [ 3992.191353] no locks held by swapper/1/0.
      [ 3992.191355]
      [ 3992.191355] stack backtrace:
      [ 3992.191358] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.17.0-rc6net-next+ #72
      [ 3992.191360] Hardware name:                  /DZ77RE-75K, BIOS GAZ7711H.86A.0060.2012.1115.1750 11/15/2012
      [ 3992.191362]  0000000000000001 ffff880235803e48 ffffffff8178f92c 0000000000000000
      [ 3992.191366]  ffff8802322224a0 ffff880235803e78 ffffffff810c9966 ffff8800a5fe3000
      [ 3992.191370]  ffff880235803f30 ffff8802359cd768 ffff8802359cd6e0 ffff880235803e98
      [ 3992.191374] Call Trace:
      [ 3992.191376]  <IRQ>  [<ffffffff8178f92c>] dump_stack+0x4e/0x68
      [ 3992.191387]  [<ffffffff810c9966>] lockdep_rcu_suspicious+0xe6/0x130
      [ 3992.191392]  [<ffffffff8167213a>] qdisc_watchdog+0x8a/0xb0
      [ 3992.191396]  [<ffffffff810f93f2>] __run_hrtimer+0x72/0x420
      [ 3992.191399]  [<ffffffff810f9bcd>] ? hrtimer_interrupt+0x7d/0x240
      [ 3992.191403]  [<ffffffff816720b0>] ? tc_classify+0xc0/0xc0
      [ 3992.191406]  [<ffffffff810f9c4f>] hrtimer_interrupt+0xff/0x240
      [ 3992.191410]  [<ffffffff8109e4a5>] ? __atomic_notifier_call_chain+0x5/0x140
      [ 3992.191415]  [<ffffffff8103577b>] local_apic_timer_interrupt+0x3b/0x60
      [ 3992.191419]  [<ffffffff8179c2b5>] smp_apic_timer_interrupt+0x45/0x60
      [ 3992.191422]  [<ffffffff8179a6bf>] apic_timer_interrupt+0x6f/0x80
      [ 3992.191424]  <EOI>  [<ffffffff815ed233>] ? cpuidle_enter_state+0x73/0x2e0
      [ 3992.191432]  [<ffffffff815ed22e>] ? cpuidle_enter_state+0x6e/0x2e0
      [ 3992.191437]  [<ffffffff815ed567>] cpuidle_enter+0x17/0x20
      [ 3992.191441]  [<ffffffff810c0741>] cpu_startup_entry+0x3d1/0x4a0
      [ 3992.191445]  [<ffffffff81106fc6>] ? clockevents_config_and_register+0x26/0x30
      [ 3992.191448]  [<ffffffff81033c16>] start_secondary+0x1b6/0x260
      
      Fixes: b26b0d1e ("net: qdisc: use rcu prefix and silence sparse warnings")
      Signed-off-by: NJohn Fastabend <john.r.fastabend@intel.com>
      Acked-by: NCong Wang <cwang@twopensource.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1e203c1a
  2. 30 9月, 2014 4 次提交
  3. 26 9月, 2014 1 次提交
    • E
      net: sched: use pinned timers · 4a8e320c
      Eric Dumazet 提交于
      While using a MQ + NETEM setup, I had confirmation that the default
      timer migration ( /proc/sys/kernel/timer_migration ) is killing us.
      
      Installing this on a receiver side of a TCP_STREAM test, (NIC has 8 TX
      queues) :
      
      EST="est 1sec 4sec"
      for ETH in eth1
      do
       tc qd del dev $ETH root 2>/dev/null
       tc qd add dev $ETH root handle 1: mq
       tc qd add dev $ETH parent 1:1 $EST netem limit 70000 delay 6ms
       tc qd add dev $ETH parent 1:2 $EST netem limit 70000 delay 8ms
       tc qd add dev $ETH parent 1:3 $EST netem limit 70000 delay 10ms
       tc qd add dev $ETH parent 1:4 $EST netem limit 70000 delay 12ms
       tc qd add dev $ETH parent 1:5 $EST netem limit 70000 delay 14ms
       tc qd add dev $ETH parent 1:6 $EST netem limit 70000 delay 16ms
       tc qd add dev $ETH parent 1:7 $EST netem limit 80000 delay 18ms
       tc qd add dev $ETH parent 1:8 $EST netem limit 90000 delay 20ms
      done
      
      We can see that timers get migrated into a single cpu, presumably idle
      at the time timers are set up.
      Then all qdisc dequeues run from this cpu and huge lock contention
      happens. This single cpu is stuck in softirq mode and cannot dequeue
      fast enough.
      
          39.24%  [kernel]          [k] _raw_spin_lock
           2.65%  [kernel]          [k] netem_enqueue
           1.80%  [kernel]          [k] netem_dequeue
           1.63%  [kernel]          [k] copy_user_enhanced_fast_string
           1.45%  [kernel]          [k] _raw_spin_lock_bh
      
      By pinning qdisc timers on the cpu running the qdisc, we respect proper
      XPS setting and remove this lock contention.
      
           5.84%  [kernel]          [k] netem_enqueue
           4.83%  [kernel]          [k] _raw_spin_lock
           2.92%  [kernel]          [k] copy_user_enhanced_fast_string
      
      Current Qdiscs that benefit from this change are :
      
      	netem, cbq, fq, hfsc, tbf, htb.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4a8e320c
  4. 14 9月, 2014 2 次提交
  5. 12 6月, 2014 1 次提交
    • F
      net_sched: drr: warn when qdisc is not work conserving · 6e765a00
      Florian Westphal 提交于
      The DRR scheduler requires that items on the active list are work
      conserving, i.e. do not hold on to skbs for throttling purposes, etc.
      Attaching e.g. tbf renders DRR useless because all other classes on the
      active list are delayed as well.
      
      So, warn users that this configuration won't work as expected; we
      already do this in couple of other qdiscs, see e.g.
      
      commit b00355db
      ('pkt_sched: sch_hfsc: sch_htb: Add non-work-conserving warning handler')
      
      The 'const' change is needed to avoid compiler warning ("discards 'const'
      qualifier from pointer target type").
      
      tested with:
      drr_hier() {
              parent=$1
              classes=$2
              for i in  $(seq 1 $classes); do
                      classid=$parent$(printf %x $i)
                      tc class add dev eth0 parent $parent classid $classid drr
      		tc qdisc add dev eth0 parent $classid tbf rate 64kbit burst 256kbit limit 64kbit
              done
      }
      tc qdisc add dev eth0 root handle 1: drr
      drr_hier 1: 32
      tc filter add dev eth0 protocol all pref 1 parent 1: handle 1 flow hash keys dst perturb 1 divisor 32
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6e765a00
  6. 03 5月, 2014 1 次提交
  7. 25 4月, 2014 1 次提交
  8. 12 3月, 2014 2 次提交
  9. 11 3月, 2014 1 次提交
  10. 01 1月, 2014 1 次提交
  11. 14 12月, 2013 1 次提交
  12. 10 12月, 2013 1 次提交
  13. 09 10月, 2013 1 次提交
  14. 01 9月, 2013 1 次提交
  15. 31 8月, 2013 1 次提交
    • S
      qdisc: allow setting default queuing discipline · 6da7c8fc
      stephen hemminger 提交于
      By default, the pfifo_fast queue discipline has been used by default
      for all devices. But we have better choices now.
      
      This patch allow setting the default queueing discipline with sysctl.
      This allows easy use of better queueing disciplines on all devices
      without having to use tc qdisc scripts. It is intended to allow
      an easy path for distributions to make fq_codel or sfq the default
      qdisc.
      
      This patch also makes pfifo_fast more of a first class qdisc, since
      it is now possible to manually override the default and explicitly
      use pfifo_fast. The behavior for systems who do not use the sysctl
      is unchanged, they still get pfifo_fast
      
      Also removes leftover random # in sysctl net core.
      Signed-off-by: NStephen Hemminger <stephen@networkplumber.org>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6da7c8fc
  16. 15 8月, 2013 1 次提交
    • J
      net_sched: restore "linklayer atm" handling · 8a8e3d84
      Jesper Dangaard Brouer 提交于
      commit 56b765b7 ("htb: improved accuracy at high rates")
      broke the "linklayer atm" handling.
      
       tc class add ... htb rate X ceil Y linklayer atm
      
      The linklayer setting is implemented by modifying the rate table
      which is send to the kernel.  No direct parameter were
      transferred to the kernel indicating the linklayer setting.
      
      The commit 56b765b7 ("htb: improved accuracy at high rates")
      removed the use of the rate table system.
      
      To keep compatible with older iproute2 utils, this patch detects
      the linklayer by parsing the rate table.  It also supports future
      versions of iproute2 to send this linklayer parameter to the
      kernel directly. This is done by using the __reserved field in
      struct tc_ratespec, to convey the choosen linklayer option, but
      only using the lower 4 bits of this field.
      
      Linklayer detection is limited to speeds below 100Mbit/s, because
      at high rates the rtab is gets too inaccurate, so bad that
      several fields contain the same values, this resembling the ATM
      detect.  Fields even start to contain "0" time to send, e.g. at
      1000Mbit/s sending a 96 bytes packet cost "0", thus the rtab have
      been more broken than we first realized.
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8a8e3d84
  17. 08 6月, 2013 1 次提交
  18. 29 3月, 2013 1 次提交
  19. 27 3月, 2013 1 次提交
  20. 22 3月, 2013 1 次提交
  21. 28 2月, 2013 1 次提交
    • S
      hlist: drop the node parameter from iterators · b67bfe0d
      Sasha Levin 提交于
      I'm not sure why, but the hlist for each entry iterators were conceived
      
              list_for_each_entry(pos, head, member)
      
      The hlist ones were greedy and wanted an extra parameter:
      
              hlist_for_each_entry(tpos, pos, head, member)
      
      Why did they need an extra pos parameter? I'm not quite sure. Not only
      they don't really need it, it also prevents the iterator from looking
      exactly like the list iterator, which is unfortunate.
      
      Besides the semantic patch, there was some manual work required:
      
       - Fix up the actual hlist iterators in linux/list.h
       - Fix up the declaration of other iterators based on the hlist ones.
       - A very small amount of places were using the 'node' parameter, this
       was modified to use 'obj->member' instead.
       - Coccinelle didn't handle the hlist_for_each_entry_safe iterator
       properly, so those had to be fixed up manually.
      
      The semantic patch which is mostly the work of Peter Senna Tschudin is here:
      
      @@
      iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;
      
      type T;
      expression a,c,d,e;
      identifier b;
      statement S;
      @@
      
      -T b;
          <+... when != b
      (
      hlist_for_each_entry(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_continue(a,
      - b,
      c) S
      |
      hlist_for_each_entry_from(a,
      - b,
      c) S
      |
      hlist_for_each_entry_rcu(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_rcu_bh(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_continue_rcu_bh(a,
      - b,
      c) S
      |
      for_each_busy_worker(a, c,
      - b,
      d) S
      |
      ax25_uid_for_each(a,
      - b,
      c) S
      |
      ax25_for_each(a,
      - b,
      c) S
      |
      inet_bind_bucket_for_each(a,
      - b,
      c) S
      |
      sctp_for_each_hentry(a,
      - b,
      c) S
      |
      sk_for_each(a,
      - b,
      c) S
      |
      sk_for_each_rcu(a,
      - b,
      c) S
      |
      sk_for_each_from
      -(a, b)
      +(a)
      S
      + sk_for_each_from(a) S
      |
      sk_for_each_safe(a,
      - b,
      c, d) S
      |
      sk_for_each_bound(a,
      - b,
      c) S
      |
      hlist_for_each_entry_safe(a,
      - b,
      c, d, e) S
      |
      hlist_for_each_entry_continue_rcu(a,
      - b,
      c) S
      |
      nr_neigh_for_each(a,
      - b,
      c) S
      |
      nr_neigh_for_each_safe(a,
      - b,
      c, d) S
      |
      nr_node_for_each(a,
      - b,
      c) S
      |
      nr_node_for_each_safe(a,
      - b,
      c, d) S
      |
      - for_each_gfn_sp(a, c, d, b) S
      + for_each_gfn_sp(a, c, d) S
      |
      - for_each_gfn_indirect_valid_sp(a, c, d, b) S
      + for_each_gfn_indirect_valid_sp(a, c, d) S
      |
      for_each_host(a,
      - b,
      c) S
      |
      for_each_host_safe(a,
      - b,
      c, d) S
      |
      for_each_mesh_entry(a,
      - b,
      c, d) S
      )
          ...+>
      
      [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
      [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
      [akpm@linux-foundation.org: checkpatch fixes]
      [akpm@linux-foundation.org: fix warnings]
      [akpm@linux-foudnation.org: redo intrusive kvm changes]
      Tested-by: NPeter Senna Tschudin <peter.senna@gmail.com>
      Acked-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b67bfe0d
  22. 19 2月, 2013 2 次提交
  23. 13 2月, 2013 1 次提交
  24. 12 12月, 2012 1 次提交
    • E
      pkt_sched: avoid requeues if possible · 1abbe139
      Eric Dumazet 提交于
      With BQL being deployed, we can more likely have following behavior :
      
      We dequeue a packet from qdisc in dequeue_skb(), then we realize target
      tx queue is in XOFF state in sch_direct_xmit(), and we have to hold the
      skb into gso_skb for later.
      
      This shows in stats (tc -s qdisc dev eth0) as requeues.
      
      Problem of these requeues is that high priority packets can not be
      dequeued as long as this (possibly low prio and big TSO packet) is not
      removed from gso_skb.
      
      At 1Gbps speed, a full size TSO packet is 500 us of extra latency.
      
      In some cases, we know that all packets dequeued from a qdisc are
      for a particular and known txq :
      
      - If device is non multi queue
      - For all MQ/MQPRIO slave qdiscs
      
      This patch introduces a new qdisc flag, TCQ_F_ONETXQUEUE to mark
      this capability, so that dequeue_skb() is allowed to dequeue a packet
      only if the associated txq is not stopped.
      
      This indeed reduce latencies for high prio packets (or improve fairness
      with sfq/fq_codel), and almost remove qdisc 'requeues'.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: John Fastabend <john.r.fastabend@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1abbe139
  25. 19 11月, 2012 1 次提交
  26. 22 10月, 2012 1 次提交
  27. 11 9月, 2012 1 次提交
  28. 27 6月, 2012 1 次提交
  29. 16 5月, 2012 1 次提交
  30. 02 4月, 2012 1 次提交
  31. 04 1月, 2012 1 次提交
    • E
      net_sched: qdisc_alloc_handle() can be too slow · fa0f5aa7
      Eric Dumazet 提交于
      When trying to allocate ~32768 qdiscs using autohandle mechanism, we can
      fill the space managed by kernel (handles in [8000-FFFF]:0000 range)
      
      But O(N^2) qdisc_alloc_handle() loops 0x10000 times instead of 0x8000
      
      time tc add qdisc add dev eth0 parent 10:7fff pfifo limit 10
      RTNETLINK answers: Cannot allocate memory
      real    1m54.826s
      user    0m0.000s
      sys     0m0.004s
      
      INFO: rcu_sched_state detected stall on CPU 0 (t=60000 jiffies)
      
      Half number of loops, and add a cond_resched() call.
      We hold rtnl at this point.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      CC: Dave Taht <dave.taht@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fa0f5aa7
  32. 06 7月, 2011 1 次提交
  33. 10 6月, 2011 1 次提交
    • G
      rtnetlink: Compute and store minimum ifinfo dump size · c7ac8679
      Greg Rose 提交于
      The message size allocated for rtnl ifinfo dumps was limited to
      a single page.  This is not enough for additional interface info
      available with devices that support SR-IOV and caused a bug in
      which VF info would not be displayed if more than approximately
      40 VFs were created per interface.
      
      Implement a new function pointer for the rtnl_register service that will
      calculate the amount of data required for the ifinfo dump and allocate
      enough data to satisfy the request.
      Signed-off-by: NGreg Rose <gregory.v.rose@intel.com>
      Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
      c7ac8679
  34. 20 5月, 2011 1 次提交