1. 12 6月, 2013 1 次提交
  2. 11 6月, 2013 1 次提交
    • E
      net_sched: add 64bit rate estimators · 45203a3b
      Eric Dumazet 提交于
      struct gnet_stats_rate_est contains u32 fields, so the bytes per second
      field can wrap at 34360Mbit.
      
      Add a new gnet_stats_rate_est64 structure to get 64bit bps/pps fields,
      and switch the kernel to use this structure natively.
      
      This structure is dumped to user space as a new attribute :
      
      TCA_STATS_RATE_EST64
      
      Old tc command will now display the capped bps (to 34360Mbit), instead
      of wrapped values, and updated tc command will display correct
      information.
      
      Old tc command output, after patch :
      
      eric:~# tc -s -d qd sh dev lo
      qdisc pfifo 8001: root refcnt 2 limit 1000p
       Sent 80868245400 bytes 1978837 pkt (dropped 0, overlimits 0 requeues 0)
       rate 34360Mbit 189696pps backlog 0b 0p requeues 0
      
      This patch carefully reorganizes "struct Qdisc" layout to get optimal
      performance on SMP.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Ben Hutchings <bhutchings@solarflare.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      45203a3b
  3. 03 6月, 2013 1 次提交
  4. 28 2月, 2013 1 次提交
    • S
      hlist: drop the node parameter from iterators · b67bfe0d
      Sasha Levin 提交于
      I'm not sure why, but the hlist for each entry iterators were conceived
      
              list_for_each_entry(pos, head, member)
      
      The hlist ones were greedy and wanted an extra parameter:
      
              hlist_for_each_entry(tpos, pos, head, member)
      
      Why did they need an extra pos parameter? I'm not quite sure. Not only
      they don't really need it, it also prevents the iterator from looking
      exactly like the list iterator, which is unfortunate.
      
      Besides the semantic patch, there was some manual work required:
      
       - Fix up the actual hlist iterators in linux/list.h
       - Fix up the declaration of other iterators based on the hlist ones.
       - A very small amount of places were using the 'node' parameter, this
       was modified to use 'obj->member' instead.
       - Coccinelle didn't handle the hlist_for_each_entry_safe iterator
       properly, so those had to be fixed up manually.
      
      The semantic patch which is mostly the work of Peter Senna Tschudin is here:
      
      @@
      iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;
      
      type T;
      expression a,c,d,e;
      identifier b;
      statement S;
      @@
      
      -T b;
          <+... when != b
      (
      hlist_for_each_entry(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_continue(a,
      - b,
      c) S
      |
      hlist_for_each_entry_from(a,
      - b,
      c) S
      |
      hlist_for_each_entry_rcu(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_rcu_bh(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_continue_rcu_bh(a,
      - b,
      c) S
      |
      for_each_busy_worker(a, c,
      - b,
      d) S
      |
      ax25_uid_for_each(a,
      - b,
      c) S
      |
      ax25_for_each(a,
      - b,
      c) S
      |
      inet_bind_bucket_for_each(a,
      - b,
      c) S
      |
      sctp_for_each_hentry(a,
      - b,
      c) S
      |
      sk_for_each(a,
      - b,
      c) S
      |
      sk_for_each_rcu(a,
      - b,
      c) S
      |
      sk_for_each_from
      -(a, b)
      +(a)
      S
      + sk_for_each_from(a) S
      |
      sk_for_each_safe(a,
      - b,
      c, d) S
      |
      sk_for_each_bound(a,
      - b,
      c) S
      |
      hlist_for_each_entry_safe(a,
      - b,
      c, d, e) S
      |
      hlist_for_each_entry_continue_rcu(a,
      - b,
      c) S
      |
      nr_neigh_for_each(a,
      - b,
      c) S
      |
      nr_neigh_for_each_safe(a,
      - b,
      c, d) S
      |
      nr_node_for_each(a,
      - b,
      c) S
      |
      nr_node_for_each_safe(a,
      - b,
      c, d) S
      |
      - for_each_gfn_sp(a, c, d, b) S
      + for_each_gfn_sp(a, c, d) S
      |
      - for_each_gfn_indirect_valid_sp(a, c, d, b) S
      + for_each_gfn_indirect_valid_sp(a, c, d) S
      |
      for_each_host(a,
      - b,
      c) S
      |
      for_each_host_safe(a,
      - b,
      c, d) S
      |
      for_each_mesh_entry(a,
      - b,
      c, d) S
      )
          ...+>
      
      [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
      [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
      [akpm@linux-foundation.org: checkpatch fixes]
      [akpm@linux-foundation.org: fix warnings]
      [akpm@linux-foudnation.org: redo intrusive kvm changes]
      Tested-by: NPeter Senna Tschudin <peter.senna@gmail.com>
      Acked-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b67bfe0d
  5. 13 2月, 2013 1 次提交
  6. 15 1月, 2013 1 次提交
  7. 12 12月, 2012 1 次提交
    • E
      pkt_sched: avoid requeues if possible · 1abbe139
      Eric Dumazet 提交于
      With BQL being deployed, we can more likely have following behavior :
      
      We dequeue a packet from qdisc in dequeue_skb(), then we realize target
      tx queue is in XOFF state in sch_direct_xmit(), and we have to hold the
      skb into gso_skb for later.
      
      This shows in stats (tc -s qdisc dev eth0) as requeues.
      
      Problem of these requeues is that high priority packets can not be
      dequeued as long as this (possibly low prio and big TSO packet) is not
      removed from gso_skb.
      
      At 1Gbps speed, a full size TSO packet is 500 us of extra latency.
      
      In some cases, we know that all packets dequeued from a qdisc are
      for a particular and known txq :
      
      - If device is non multi queue
      - For all MQ/MQPRIO slave qdiscs
      
      This patch introduces a new qdisc flag, TCQ_F_ONETXQUEUE to mark
      this capability, so that dequeue_skb() is allowed to dequeue a packet
      only if the associated txq is not stopped.
      
      This indeed reduce latencies for high prio packets (or improve fairness
      with sfq/fq_codel), and almost remove qdisc 'requeues'.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: John Fastabend <john.r.fastabend@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1abbe139
  8. 15 8月, 2012 1 次提交
  9. 21 7月, 2012 1 次提交
  10. 13 6月, 2012 1 次提交
    • E
      bonding: Fix corrupted queue_mapping · 5ee31c68
      Eric Dumazet 提交于
      In the transmit path of the bonding driver, skb->cb is used to
      stash the skb->queue_mapping so that the bonding device can set its
      own queue mapping.  This value becomes corrupted since the skb->cb is
      also used in __dev_xmit_skb.
      
      When transmitting through bonding driver, bond_select_queue is
      called from dev_queue_xmit.  In bond_select_queue the original
      skb->queue_mapping is copied into skb->cb (via bond_queue_mapping)
      and skb->queue_mapping is overwritten with the bond driver queue.
      
      Subsequently in dev_queue_xmit, __dev_xmit_skb is called which writes
      the packet length into skb->cb, thereby overwriting the stashed
      queue mappping.  In bond_dev_queue_xmit (called from hard_start_xmit),
      the queue mapping for the skb is set to the stashed value which is now
      the skb length and hence is an invalid queue for the slave device.
      
      If we want to save skb->queue_mapping into skb->cb[], best place is to
      add a field in struct qdisc_skb_cb, to make sure it wont conflict with
      other layers (eg : Qdiscc, Infiniband...)
      
      This patchs also makes sure (struct qdisc_skb_cb)->data is aligned on 8
      bytes :
      
      netem qdisc for example assumes it can store an u64 in it, without
      misalignment penalty.
      
      Note : we only have 20 bytes left in (struct qdisc_skb_cb)->data[].
      The largest user is CHOKe and it fills it.
      
      Based on a previous patch from Tom Herbert.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NTom Herbert <therbert@google.com>
      Cc: John Fastabend <john.r.fastabend@intel.com>
      Cc: Roland Dreier <roland@kernel.org>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5ee31c68
  11. 10 2月, 2012 1 次提交
  12. 07 2月, 2012 1 次提交
  13. 01 11月, 2011 1 次提交
  14. 21 10月, 2011 1 次提交
  15. 06 7月, 2011 1 次提交
  16. 24 3月, 2011 1 次提交
    • E
      net_sched: fix THROTTLED/RUNNING race · ef352e7c
      Eric Dumazet 提交于
      commit fd245a4a (net_sched: move TCQ_F_THROTTLED flag)
      added a race.
      
      qdisc_watchdog() is run from softirq, so special care should be taken or
      we can lose one state transition (THROTTLED/RUNNING)
      
      Prior to fd245a4a, we were manipulating q->flags (qdisc->flags &=
      ~TCQ_F_THROTTLED;) and this manipulation could only race with
      qdisc_warn_nonwc().
      
      Since we want to avoid atomic ops in qdisc fast path - it was the
      meaning of commit 37112105 (QDISC_STATE_RUNNING dont need atomic
      bit ops) - fix is to move THROTTLE bit into 'state' field, this one
      being manipulated with SMP and IRQ safe operations.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ef352e7c
  17. 04 3月, 2011 1 次提交
  18. 24 2月, 2011 1 次提交
  19. 21 1月, 2011 3 次提交
    • E
      net_sched: accurate bytes/packets stats/rates · 9190b3b3
      Eric Dumazet 提交于
      In commit 44b82883 (net_sched: pfifo_head_drop problem), we fixed
      a problem with pfifo_head drops that incorrectly decreased
      sch->bstats.bytes and sch->bstats.packets
      
      Several qdiscs (CHOKe, SFQ, pfifo_head, ...) are able to drop a
      previously enqueued packet, and bstats cannot be changed, so
      bstats/rates are not accurate (over estimated)
      
      This patch changes the qdisc_bstats updates to be done at dequeue() time
      instead of enqueue() time. bstats counters no longer account for dropped
      frames, and rates are more correct, since enqueue() bursts dont have
      effect on dequeue() rate.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Acked-by: NStephen Hemminger <shemminger@vyatta.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9190b3b3
    • E
      net_sched: RCU conversion of stab · a2da570d
      Eric Dumazet 提交于
      This patch converts stab qdisc management to RCU, so that we can perform
      the qdisc_calculate_pkt_len() call before getting qdisc lock.
      
      This shortens the lock's held time in __dev_xmit_skb().
      
      This permits more qdiscs to get TCQ_F_CAN_BYPASS status, avoiding lot of
      cache misses and so reducing latencies.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      CC: Patrick McHardy <kaber@trash.net>
      CC: Jesper Dangaard Brouer <hawk@diku.dk>
      CC: Jarek Poplawski <jarkao2@gmail.com>
      CC: Jamal Hadi Salim <hadi@cyberus.ca>
      CC: Stephen Hemminger <shemminger@vyatta.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a2da570d
    • E
      net_sched: move TCQ_F_THROTTLED flag · fd245a4a
      Eric Dumazet 提交于
      In commit 37112105 (net: QDISC_STATE_RUNNING dont need atomic bit
      ops) I moved QDISC_STATE_RUNNING flag to __state container, located in
      the cache line containing qdisc lock and often dirtied fields.
      
      I now move TCQ_F_THROTTLED bit too, so that we let first cache line read
      mostly, and shared by all cpus. This should speedup HTB/CBQ for example.
      
      Not using test_bit()/__clear_bit()/__test_and_set_bit allows to use an
      "unsigned int" for __state container, reducing by 8 bytes Qdisc size.
      
      Introduce helpers to hide implementation details.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      CC: Patrick McHardy <kaber@trash.net>
      CC: Jesper Dangaard Brouer <hawk@diku.dk>
      CC: Jarek Poplawski <jarkao2@gmail.com>
      CC: Jamal Hadi Salim <hadi@cyberus.ca>
      CC: Stephen Hemminger <shemminger@vyatta.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fd245a4a
  20. 11 1月, 2011 1 次提交
  21. 21 12月, 2010 1 次提交
  22. 17 12月, 2010 1 次提交
  23. 21 10月, 2010 1 次提交
  24. 24 9月, 2010 1 次提交
  25. 03 7月, 2010 2 次提交
  26. 29 6月, 2010 1 次提交
  27. 02 6月, 2010 3 次提交
    • E
      net: add additional lock to qdisc to increase throughput · 79640a4c
      Eric Dumazet 提交于
      When many cpus compete for sending frames on a given qdisc, the qdisc
      spinlock suffers from very high contention.
      
      The cpu owning __QDISC_STATE_RUNNING bit has same priority to acquire
      the lock, and cannot dequeue packets fast enough, since it must wait for
      this lock for each dequeued packet.
      
      One solution to this problem is to force all cpus spinning on a second
      lock before trying to get the main lock, when/if they see
      __QDISC_STATE_RUNNING already set.
      
      The owning cpu then compete with at most one other cpu for the main
      lock, allowing for higher dequeueing rate.
      
      Based on a previous patch from Alexander Duyck. I added the heuristic to
      avoid the atomic in fast path, and put the new lock far away from the
      cache line used by the dequeue worker. Also try to release the busylock
      lock as late as possible.
      
      Tests with following script gave a boost from ~50.000 pps to ~600.000
      pps on a dual quad core machine (E5450 @3.00GHz), tg3 driver.
      (A single netperf flow can reach ~800.000 pps on this platform)
      
      for j in `seq 0 3`; do
        for i in `seq 0 7`; do
          netperf -H 192.168.0.1 -t UDP_STREAM -l 60 -N -T $i -- -m 6 &
        done
      done
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Acked-by: NAlexander Duyck <alexander.h.duyck@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      79640a4c
    • E
      net: QDISC_STATE_RUNNING dont need atomic bit ops · 37112105
      Eric Dumazet 提交于
      __QDISC_STATE_RUNNING is always changed while qdisc lock is held.
      
      We can avoid two atomic operations in xmit path, if we move this bit in
      a new __state container.
      
      Location of this __state container is carefully chosen so that fast path
      only dirties one qdisc cache line.
      
      THROTTLED bit could later be moved into this __state location too, to
      avoid dirtying first qdisc cache line.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      37112105
    • E
      net: Define accessors to manipulate QDISC_STATE_RUNNING · bc135b23
      Eric Dumazet 提交于
      Define three helpers to manipulate QDISC_STATE_RUNNIG flag, that a
      second patch will move on another location.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bc135b23
  28. 02 4月, 2010 1 次提交
    • E
      gen_estimator: deadlock fix · 5d944c64
      Eric Dumazet 提交于
      One of my test machine got a deadlock during "tc" sessions,
      adding/deleting classes & filters, using traffic estimators.
      
      After some analysis, I believe we have a potential use after free case
      in est_timer() :
      
      spin_lock(e->stats_lock); << HERE >>
      read_lock(&est_lock);
      if (e->bstats == NULL)   << TEST >>
      	goto skip;
      
      Test is done a bit late, because after estimator is killed, and before
      rcu grace period elapsed, we might already have freed/reuse memory where
      e->stats_locks points to (some qdisc->q.lock)
      
      A possible fix is to respect a rcu grace period at Qdisc dismantle time.
      
      On 64bit, sizeof(struct Qdisc) is exactly 192 bytes. Adding 16 bytes to
      it (for struct rcu_head) is a problem because it might change
      performance, given QDISC_ALIGNTO is 32 bytes.
      
      This is why I also change QDISC_ALIGNTO to 64 bytes, to satisfy most
      current alignment requirements.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5d944c64
  29. 29 1月, 2010 1 次提交
  30. 04 11月, 2009 1 次提交
  31. 15 9月, 2009 1 次提交
    • J
      pkt_sched: Fix tx queue selection in tc_modify_qdisc · 926e61b7
      Jarek Poplawski 提交于
      After the recent mq change there is the new select_queue qdisc class
      method used in tc_modify_qdisc, but it works OK only for direct child
      qdiscs of mq qdisc. Grandchildren always get the first tx queue, which
      would give wrong qdisc_root etc. results (e.g. for sch_htb as child of
      sch_prio). This patch fixes it by using parent's dev_queue for such
      grandchildren qdiscs. The select_queue method's return type is changed
      BTW.
      
      With feedback from: Patrick McHardy <kaber@trash.net>
      Signed-off-by: NJarek Poplawski <jarkao2@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      926e61b7
  32. 10 9月, 2009 1 次提交
    • P
      net_sched: fix estimator lock selection for mq child qdiscs · 23bcf634
      Patrick McHardy 提交于
      When new child qdiscs are attached to the mq qdisc, they are actually
      attached as root qdiscs to the device queues. The lock selection for
      new estimators incorrectly picks the root lock of the existing and
      to be replaced qdisc, which results in a use-after-free once the old
      qdisc has been destroyed.
      
      Mark mq qdisc instances with a new flag and treat qdiscs attached to
      mq as children similar to regular root qdiscs.
      
      Additionally prevent estimators from being attached to the mq qdisc
      itself since it only updates its byte and packet counters during dumps.
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      23bcf634
  33. 06 9月, 2009 2 次提交
  34. 18 8月, 2009 1 次提交