1. 14 9月, 2014 1 次提交
  2. 19 7月, 2013 1 次提交
    • P
      pkt_sched: sch_qfq: remove a source of high packet delay/jitter · 87f40dd6
      Paolo Valente 提交于
      QFQ+ inherits from QFQ a design choice that may cause a high packet
      delay/jitter and a severe short-term unfairness. As QFQ, QFQ+ uses a
      special quantity, the system virtual time, to track the service
      provided by the ideal system it approximates. When a packet is
      dequeued, this quantity must be incremented by the size of the packet,
      divided by the sum of the weights of the aggregates waiting to be
      served. Tracking this sum correctly is a non-trivial task, because, to
      preserve tight service guarantees, the decrement of this sum must be
      delayed in a special way [1]: this sum can be decremented only after
      that its value would decrease also in the ideal system approximated by
      QFQ+. For efficiency, QFQ+ keeps track only of the 'instantaneous'
      weight sum, increased and decreased immediately as the weight of an
      aggregate changes, and as an aggregate is created or destroyed (which,
      in its turn, happens as a consequence of some class being
      created/destroyed/changed). However, to avoid the problems caused to
      service guarantees by these immediate decreases, QFQ+ increments the
      system virtual time using the maximum value allowed for the weight
      sum, 2^10, in place of the dynamic, instantaneous value. The
      instantaneous value of the weight sum is used only to check whether a
      request of weight increase or a class creation can be satisfied.
      
      Unfortunately, the problems caused by this choice are worse than the
      temporary degradation of the service guarantees that may occur, when a
      class is changed or destroyed, if the instantaneous value of the
      weight sum was used to update the system virtual time. In fact, the
      fraction of the link bandwidth guaranteed by QFQ+ to each aggregate is
      equal to the ratio between the weight of the aggregate and the sum of
      the weights of the competing aggregates. The packet delay guaranteed
      to the aggregate is instead inversely proportional to the guaranteed
      bandwidth. By using the maximum possible value, and not the actual
      value of the weight sum, QFQ+ provides each aggregate with the worst
      possible service guarantees, and not with service guarantees related
      to the actual set of competing aggregates. To see the consequences of
      this fact, consider the following simple example.
      
      Suppose that only the following aggregates are backlogged, i.e., that
      only the classes in the following aggregates have packets to transmit:
      one aggregate with weight 10, say A, and ten aggregates with weight 1,
      say B1, B2, ..., B10. In particular, suppose that these aggregates are
      always backlogged. Given the weight distribution, the smoothest and
      fairest service order would be:
      A B1 A B2 A B3 A B4 A B5 A B6 A B7 A B8 A B9 A B10 A B1 A B2 ...
      
      QFQ+ would provide exactly this optimal service if it used the actual
      value for the weight sum instead of the maximum possible value, i.e.,
      11 instead of 2^10. In contrast, since QFQ+ uses the latter value, it
      serves aggregates as follows (easy to prove and to reproduce
      experimentally):
      A B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 A A A A A A A A A A B1 B2 ... B10 A A ...
      
      By replacing 10 with N in the above example, and by increasing N, one
      can increase at will the maximum packet delay and the jitter
      experienced by the classes in aggregate A.
      
      This patch addresses this issue by just using the above
      'instantaneous' value of the weight sum, instead of the maximum
      possible value, when updating the system virtual time.  After the
      instantaneous weight sum is decreased, QFQ+ may deviate from the ideal
      service for a time interval in the order of the time to serve one
      maximum-size packet for each backlogged class. The worst-case extent
      of the deviation exhibited by QFQ+ during this time interval [1] is
      basically the same as of the deviation described above (but, without
      this patch, QFQ+ suffers from such a deviation all the time). Finally,
      this patch modifies the comment to the function qfq_slot_insert, to
      make it coherent with the fact that the weight sum used by QFQ+ can
      now be lower than the maximum possible value.
      
      [1] P. Valente, "Extending WF2Q+ to support a dynamic traffic mix",
      Proceedings of AAA-IDEA'05, June 2005.
      Signed-off-by: NPaolo Valente <paolo.valente@unimore.it>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      87f40dd6
  3. 12 7月, 2013 2 次提交
  4. 11 6月, 2013 1 次提交
    • E
      net_sched: add 64bit rate estimators · 45203a3b
      Eric Dumazet 提交于
      struct gnet_stats_rate_est contains u32 fields, so the bytes per second
      field can wrap at 34360Mbit.
      
      Add a new gnet_stats_rate_est64 structure to get 64bit bps/pps fields,
      and switch the kernel to use this structure natively.
      
      This structure is dumped to user space as a new attribute :
      
      TCA_STATS_RATE_EST64
      
      Old tc command will now display the capped bps (to 34360Mbit), instead
      of wrapped values, and updated tc command will display correct
      information.
      
      Old tc command output, after patch :
      
      eric:~# tc -s -d qd sh dev lo
      qdisc pfifo 8001: root refcnt 2 limit 1000p
       Sent 80868245400 bytes 1978837 pkt (dropped 0, overlimits 0 requeues 0)
       rate 34360Mbit 189696pps backlog 0b 0p requeues 0
      
      This patch carefully reorganizes "struct Qdisc" layout to get optimal
      performance on SMP.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Ben Hutchings <bhutchings@solarflare.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      45203a3b
  5. 06 3月, 2013 6 次提交
    • P
      pkt_sched: sch_qfq: remove a useless invocation of qfq_update_eligible · 76e4cb0d
      Paolo Valente 提交于
      QFQ+ can select for service only 'eligible' aggregates, i.e.,
      aggregates that would have started to be served also in the emulated
      ideal system.  As a consequence, for QFQ+ to be work conserving, at
      least one of the active aggregates must be eligible when it is time to
      choose the next aggregate to serve.
      
      The set of eligible aggregates is updated through the function
      qfq_update_eligible(), which does guarantee that, after its
      invocation, at least one of the active aggregates is eligible.
      Because of this property, this function is invoked in
      qfq_deactivate_agg() to guarantee that at least one of the active
      aggregates is still eligible after an aggregate has been deactivated.
      In particular, the critical case is when there are other active
      aggregates, but the aggregate being deactivated happens to be the only
      one eligible.
      
      However, this precaution is not needed for QFQ+ to be work conserving,
      because update_eligible() is always invoked also at the beginning of
      qfq_choose_next_agg(). This patch removes the additional invocation of
      update_eligible() in qfq_deactivate_agg().
      Signed-off-by: NPaolo Valente <paolo.valente@unimore.it>
      Reviewed-by: NFabio Checconi <fchecconi@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      76e4cb0d
    • P
      pkt_sched: sch_qfq: do not allow virtual time to jump if an aggregate is in service · 40dd2d54
      Paolo Valente 提交于
      By definition of (the algorithm of) QFQ+, the system virtual time must
      be pushed up only if there is no 'eligible' aggregate, i.e. no
      aggregate that would have started to be served also in the ideal
      system emulated by QFQ+.  QFQ+ serves only eligible aggregates, hence
      the aggregate currently in service is eligible.  As a consequence, to
      decide whether there is no eligible aggregate, QFQ+ must also check
      whether there is no aggregate in service.
      Signed-off-by: NPaolo Valente <paolo.valente@unimore.it>
      Reviewed-by: NFabio Checconi <fchecconi@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      40dd2d54
    • P
      pkt_sched: sch_qfq: prevent budget from wrapping around after a dequeue · a0143efa
      Paolo Valente 提交于
      Aggregate budgets are computed so as to guarantee that, after an
      aggregate has been selected for service, that aggregate has enough
      budget to serve at least one maximum-size packet for the classes it
      contains. For this reason, after a new aggregate has been selected
      for service, its next packet is immediately dequeued, without any
      further control.
      
      The maximum packet size for a class, lmax, can be changed through
      qfq_change_class(). In case the user sets lmax to a lower value than
      the the size of some of the still-to-arrive packets, QFQ+ will
      automatically push up lmax as it enqueues these packets.  This
      automatic push up is likely to happen with TSO/GSO.
      
      In any case, if lmax is assigned a lower value than the size of some
      of the packets already enqueued for the class, then the following
      problem may occur: the size of the next packet to dequeue for the
      class may happen to be larger than lmax, after the aggregate to which
      the class belongs has been just selected for service. In this case,
      even the budget of the aggregate, which is an unsigned value, may be
      lower than the size of the next packet to dequeue. After dequeueing
      this packet and subtracting its size from the budget, the latter would
      wrap around.
      
      This fix prevents the budget from wrapping around after any packet
      dequeue.
      Signed-off-by: NPaolo Valente <paolo.valente@unimore.it>
      Reviewed-by: NFabio Checconi <fchecconi@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a0143efa
    • P
      pkt_sched: sch_qfq: serve activated aggregates immediately if the scheduler is empty · 2f3b89a1
      Paolo Valente 提交于
      If no aggregate is in service, then the function qfq_dequeue() does
      not dequeue any packet. For this reason, to guarantee QFQ+ to be work
      conserving, a just-activated aggregate must be set as in service
      immediately if it happens to be the only active aggregate.
      This is done by the function qfq_enqueue().
      
      Unfortunately, the function qfq_add_to_agg(), used to add a class to
      an aggregate, does not perform this important additional operation.
      In particular, if: 1) qfq_add_to_agg() is invoked to complete the move
      of a class from a source aggregate, becoming, for this move, inactive,
      to a destination aggregate, becoming instead active, and 2) the
      destination aggregate becomes the only active aggregate, then this
      aggregate is not however set as in service. QFQ+ remains then in a
      non-work-conserving state until a new invocation of qfq_enqueue()
      recovers the situation.
      
      This fix solves the problem by moving the logic for setting an
      aggregate as in service directly into the function qfq_activate_agg().
      Hence, from whatever point qfq_activate_aggregate() is invoked, QFQ+
      remains work conserving.  Since the more-complex logic of this new
      version of activate_aggregate() is not necessary, in qfq_dequeue(), to
      reschedule an aggregate that finishes its budget, then the aggregate
      is now rescheduled by invoking directly the functions needed.
      Signed-off-by: NPaolo Valente <paolo.valente@unimore.it>
      Reviewed-by: NFabio Checconi <fchecconi@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2f3b89a1
    • P
      pkt_sched: sch_qfq: fix the update of eligible-group sets · 624b85fb
      Paolo Valente 提交于
      Between two invocations of make_eligible, the system virtual time may
      happen to grow enough that, in its binary representation, a bit with
      higher order than 31 flips. This happens especially with
      TSO/GSO. Before this fix, the mask used in make_eligible was computed
      as (1UL<<index_of_last_flipped_bit)-1, whose value is well defined on
      a 64-bit architecture, because index_of_flipped_bit <= 63, but is in
      general undefined on a 32-bit architecture if index_of_flipped_bit > 31.
      The fix just replaces 1UL with 1ULL.
      Signed-off-by: NPaolo Valente <paolo.valente@unimore.it>
      Reviewed-by: NFabio Checconi <fchecconi@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      624b85fb
    • P
      pkt_sched: sch_qfq: properly cap timestamps in charge_actual_service · 9b99b7e9
      Paolo Valente 提交于
      QFQ+ schedules the active aggregates in a group using a bucket list
      (one list per group). The bucket in which each aggregate is inserted
      depends on the aggregate's timestamps, and the number
      of buckets in a group is enough to accomodate the possible (range of)
      values of the timestamps of all the aggregates in the group. For this
      property to hold, timestamps must however be computed correctly.  One
      necessary condition for computing timestamps correctly is that the
      number of bits dequeued for each aggregate, while the aggregate is in
      service, does not exceed the maximum budget budgetmax assigned to the
      aggregate.
      
      For each aggregate, budgetmax is proportional to the number of classes
      in the aggregate. If the number of classes of the aggregate is
      decreased through qfq_change_class(), then budgetmax is decreased
      automatically as well.  Problems may occur if the aggregate is in
      service when budgetmax is decreased, because the current remaining
      budget of the aggregate and/or the service already received by the
      aggregate may happen to be larger than the new value of budgetmax.  In
      this case, when the aggregate is eventually deselected and its
      timestamps are updated, the aggregate may happen to have received an
      amount of service larger than budgetmax.  This may cause the aggregate
      to be assigned a higher virtual finish time than the maximum
      acceptable value for the last bucket in the bucket list of the group.
      
      This fix introduces a cap that addresses this issue.
      Signed-off-by: NPaolo Valente <paolo.valente@unimore.it>
      Reviewed-by: NFabio Checconi <fchecconi@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9b99b7e9
  6. 28 2月, 2013 1 次提交
    • S
      hlist: drop the node parameter from iterators · b67bfe0d
      Sasha Levin 提交于
      I'm not sure why, but the hlist for each entry iterators were conceived
      
              list_for_each_entry(pos, head, member)
      
      The hlist ones were greedy and wanted an extra parameter:
      
              hlist_for_each_entry(tpos, pos, head, member)
      
      Why did they need an extra pos parameter? I'm not quite sure. Not only
      they don't really need it, it also prevents the iterator from looking
      exactly like the list iterator, which is unfortunate.
      
      Besides the semantic patch, there was some manual work required:
      
       - Fix up the actual hlist iterators in linux/list.h
       - Fix up the declaration of other iterators based on the hlist ones.
       - A very small amount of places were using the 'node' parameter, this
       was modified to use 'obj->member' instead.
       - Coccinelle didn't handle the hlist_for_each_entry_safe iterator
       properly, so those had to be fixed up manually.
      
      The semantic patch which is mostly the work of Peter Senna Tschudin is here:
      
      @@
      iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;
      
      type T;
      expression a,c,d,e;
      identifier b;
      statement S;
      @@
      
      -T b;
          <+... when != b
      (
      hlist_for_each_entry(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_continue(a,
      - b,
      c) S
      |
      hlist_for_each_entry_from(a,
      - b,
      c) S
      |
      hlist_for_each_entry_rcu(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_rcu_bh(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_continue_rcu_bh(a,
      - b,
      c) S
      |
      for_each_busy_worker(a, c,
      - b,
      d) S
      |
      ax25_uid_for_each(a,
      - b,
      c) S
      |
      ax25_for_each(a,
      - b,
      c) S
      |
      inet_bind_bucket_for_each(a,
      - b,
      c) S
      |
      sctp_for_each_hentry(a,
      - b,
      c) S
      |
      sk_for_each(a,
      - b,
      c) S
      |
      sk_for_each_rcu(a,
      - b,
      c) S
      |
      sk_for_each_from
      -(a, b)
      +(a)
      S
      + sk_for_each_from(a) S
      |
      sk_for_each_safe(a,
      - b,
      c, d) S
      |
      sk_for_each_bound(a,
      - b,
      c) S
      |
      hlist_for_each_entry_safe(a,
      - b,
      c, d, e) S
      |
      hlist_for_each_entry_continue_rcu(a,
      - b,
      c) S
      |
      nr_neigh_for_each(a,
      - b,
      c) S
      |
      nr_neigh_for_each_safe(a,
      - b,
      c, d) S
      |
      nr_node_for_each(a,
      - b,
      c) S
      |
      nr_node_for_each_safe(a,
      - b,
      c, d) S
      |
      - for_each_gfn_sp(a, c, d, b) S
      + for_each_gfn_sp(a, c, d) S
      |
      - for_each_gfn_indirect_valid_sp(a, c, d, b) S
      + for_each_gfn_indirect_valid_sp(a, c, d) S
      |
      for_each_host(a,
      - b,
      c) S
      |
      for_each_host_safe(a,
      - b,
      c, d) S
      |
      for_each_mesh_entry(a,
      - b,
      c, d) S
      )
          ...+>
      
      [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
      [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
      [akpm@linux-foundation.org: checkpatch fixes]
      [akpm@linux-foundation.org: fix warnings]
      [akpm@linux-foudnation.org: redo intrusive kvm changes]
      Tested-by: NPeter Senna Tschudin <peter.senna@gmail.com>
      Acked-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b67bfe0d
  7. 29 11月, 2012 1 次提交
  8. 08 11月, 2012 1 次提交
  9. 28 9月, 2012 1 次提交
    • D
      pkt_sched: Fix warning false positives. · f54ba779
      David S. Miller 提交于
      GCC refuses to recognize that all error control flows do in fact
      set err to something.
      
      Add an explicit initialization to shut it up.
      
      net/sched/sch_drr.c: In function ‘drr_enqueue’:
      net/sched/sch_drr.c:359:11: warning: ‘err’ may be used uninitialized in this function [-Wmaybe-uninitialized]
      net/sched/sch_qfq.c: In function ‘qfq_enqueue’:
      net/sched/sch_qfq.c:885:11: warning: ‘err’ may be used uninitialized in this function [-Wmaybe-uninitialized]
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f54ba779
  10. 20 9月, 2012 1 次提交
    • P
      pkt_sched: fix virtual-start-time update in QFQ · 71261956
      Paolo Valente 提交于
      If the old timestamps of a class, say cl, are stale when the class
      becomes active, then QFQ may assign to cl a much higher start time
      than the maximum value allowed. This may happen when QFQ assigns to
      the start time of cl the finish time of a group whose classes are
      characterized by a higher value of the ratio
      max_class_pkt/weight_of_the_class with respect to that of
      cl. Inserting a class with a too high start time into the bucket list
      corrupts the data structure and may eventually lead to crashes.
      This patch limits the maximum start time assigned to a class.
      Signed-off-by: NPaolo Valente <paolo.valente@unimore.it>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      71261956
  11. 09 8月, 2012 1 次提交
    • P
      sched: add missing group change to qfq_change_class · be72f63b
      Paolo Valente 提交于
      [Resending again, as the text was corrupted by the email client]
      
      To speed up operations, QFQ internally divides classes into
      groups. Which group a class belongs to depends on the ratio between
      the maximum packet length and the weight of the class. Unfortunately
      the function qfq_change_class lacks the steps for changing the group
      of a class when the ratio max_pkt_len/weight of the class changes.
      
      For example, when the last of the following three commands is
      executed, the group of class 1:1 is not correctly changed:
      
      tc disc add dev XXX root handle 1: qfq
      tc class add dev XXX parent 1: qfq classid 1:1 weight 1
      tc class change dev XXX parent 1: classid 1:1 qfq weight 4
      
      Not changing the group of a class does not affect the long-term
      bandwidth guaranteed to the class, as the latter is independent of the
      maximum packet length, and correctly changes (only) if the weight of
      the class changes. In contrast, if the group of the class is not
      updated, the class is still guaranteed the short-term bandwidth and
      packet delay related to its old group, instead of the guarantees that
      it should receive according to its new weight and/or maximum packet
      length. This may also break service guarantees for other classes.
      This patch adds the missing operations.
      Signed-off-by: NPaolo Valente <paolo.valente@unimore.it>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      be72f63b
  12. 02 4月, 2012 1 次提交
  13. 04 1月, 2012 2 次提交
  14. 05 4月, 2011 1 次提交