1. 30 3月, 2013 1 次提交
  2. 28 3月, 2013 1 次提交
    • S
      sch: add missing u64 in psched_ratecfg_precompute() · ea872d77
      Sergey Popovich 提交于
      It seems that commit
      
      commit 292f1c7f
      Author: Jiri Pirko <jiri@resnulli.us>
      Date:   Tue Feb 12 00:12:03 2013 +0000
      
          sch: make htb_rate_cfg and functions around that generic
      
      adds little regression.
      
      Before:
      
      # tc qdisc add dev eth0 root handle 1: htb default ffff
      # tc class add dev eth0 classid 1:ffff htb rate 5Gbit
      # tc -s class show dev eth0
      class htb 1:ffff root prio 0 rate 5000Mbit ceil 5000Mbit burst 625b cburst
      625b
       Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
       rate 0bit 0pps backlog 0b 0p requeues 0
       lended: 0 borrowed: 0 giants: 0
       tokens: 31 ctokens: 31
      
      After:
      
      # tc qdisc add dev eth0 root handle 1: htb default ffff
      # tc class add dev eth0 classid 1:ffff htb rate 5Gbit
      # tc -s class show dev eth0
      class htb 1:ffff root prio 0 rate 1544Mbit ceil 1544Mbit burst 625b cburst
      625b
       Sent 5073 bytes 41 pkt (dropped 0, overlimits 0 requeues 0)
       rate 1976bit 2pps backlog 0b 0p requeues 0
       lended: 41 borrowed: 0 giants: 0
       tokens: 1802 ctokens: 1802
      
      This probably due to lost u64 cast of rate parameter in
      psched_ratecfg_precompute() (net/sched/sch_generic.c).
      Signed-off-by: NSergey Popovich <popovich_sergei@mail.ru>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ea872d77
  3. 06 3月, 2013 6 次提交
    • P
      pkt_sched: sch_qfq: remove a useless invocation of qfq_update_eligible · 76e4cb0d
      Paolo Valente 提交于
      QFQ+ can select for service only 'eligible' aggregates, i.e.,
      aggregates that would have started to be served also in the emulated
      ideal system.  As a consequence, for QFQ+ to be work conserving, at
      least one of the active aggregates must be eligible when it is time to
      choose the next aggregate to serve.
      
      The set of eligible aggregates is updated through the function
      qfq_update_eligible(), which does guarantee that, after its
      invocation, at least one of the active aggregates is eligible.
      Because of this property, this function is invoked in
      qfq_deactivate_agg() to guarantee that at least one of the active
      aggregates is still eligible after an aggregate has been deactivated.
      In particular, the critical case is when there are other active
      aggregates, but the aggregate being deactivated happens to be the only
      one eligible.
      
      However, this precaution is not needed for QFQ+ to be work conserving,
      because update_eligible() is always invoked also at the beginning of
      qfq_choose_next_agg(). This patch removes the additional invocation of
      update_eligible() in qfq_deactivate_agg().
      Signed-off-by: NPaolo Valente <paolo.valente@unimore.it>
      Reviewed-by: NFabio Checconi <fchecconi@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      76e4cb0d
    • P
      pkt_sched: sch_qfq: do not allow virtual time to jump if an aggregate is in service · 40dd2d54
      Paolo Valente 提交于
      By definition of (the algorithm of) QFQ+, the system virtual time must
      be pushed up only if there is no 'eligible' aggregate, i.e. no
      aggregate that would have started to be served also in the ideal
      system emulated by QFQ+.  QFQ+ serves only eligible aggregates, hence
      the aggregate currently in service is eligible.  As a consequence, to
      decide whether there is no eligible aggregate, QFQ+ must also check
      whether there is no aggregate in service.
      Signed-off-by: NPaolo Valente <paolo.valente@unimore.it>
      Reviewed-by: NFabio Checconi <fchecconi@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      40dd2d54
    • P
      pkt_sched: sch_qfq: prevent budget from wrapping around after a dequeue · a0143efa
      Paolo Valente 提交于
      Aggregate budgets are computed so as to guarantee that, after an
      aggregate has been selected for service, that aggregate has enough
      budget to serve at least one maximum-size packet for the classes it
      contains. For this reason, after a new aggregate has been selected
      for service, its next packet is immediately dequeued, without any
      further control.
      
      The maximum packet size for a class, lmax, can be changed through
      qfq_change_class(). In case the user sets lmax to a lower value than
      the the size of some of the still-to-arrive packets, QFQ+ will
      automatically push up lmax as it enqueues these packets.  This
      automatic push up is likely to happen with TSO/GSO.
      
      In any case, if lmax is assigned a lower value than the size of some
      of the packets already enqueued for the class, then the following
      problem may occur: the size of the next packet to dequeue for the
      class may happen to be larger than lmax, after the aggregate to which
      the class belongs has been just selected for service. In this case,
      even the budget of the aggregate, which is an unsigned value, may be
      lower than the size of the next packet to dequeue. After dequeueing
      this packet and subtracting its size from the budget, the latter would
      wrap around.
      
      This fix prevents the budget from wrapping around after any packet
      dequeue.
      Signed-off-by: NPaolo Valente <paolo.valente@unimore.it>
      Reviewed-by: NFabio Checconi <fchecconi@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a0143efa
    • P
      pkt_sched: sch_qfq: serve activated aggregates immediately if the scheduler is empty · 2f3b89a1
      Paolo Valente 提交于
      If no aggregate is in service, then the function qfq_dequeue() does
      not dequeue any packet. For this reason, to guarantee QFQ+ to be work
      conserving, a just-activated aggregate must be set as in service
      immediately if it happens to be the only active aggregate.
      This is done by the function qfq_enqueue().
      
      Unfortunately, the function qfq_add_to_agg(), used to add a class to
      an aggregate, does not perform this important additional operation.
      In particular, if: 1) qfq_add_to_agg() is invoked to complete the move
      of a class from a source aggregate, becoming, for this move, inactive,
      to a destination aggregate, becoming instead active, and 2) the
      destination aggregate becomes the only active aggregate, then this
      aggregate is not however set as in service. QFQ+ remains then in a
      non-work-conserving state until a new invocation of qfq_enqueue()
      recovers the situation.
      
      This fix solves the problem by moving the logic for setting an
      aggregate as in service directly into the function qfq_activate_agg().
      Hence, from whatever point qfq_activate_aggregate() is invoked, QFQ+
      remains work conserving.  Since the more-complex logic of this new
      version of activate_aggregate() is not necessary, in qfq_dequeue(), to
      reschedule an aggregate that finishes its budget, then the aggregate
      is now rescheduled by invoking directly the functions needed.
      Signed-off-by: NPaolo Valente <paolo.valente@unimore.it>
      Reviewed-by: NFabio Checconi <fchecconi@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2f3b89a1
    • P
      pkt_sched: sch_qfq: fix the update of eligible-group sets · 624b85fb
      Paolo Valente 提交于
      Between two invocations of make_eligible, the system virtual time may
      happen to grow enough that, in its binary representation, a bit with
      higher order than 31 flips. This happens especially with
      TSO/GSO. Before this fix, the mask used in make_eligible was computed
      as (1UL<<index_of_last_flipped_bit)-1, whose value is well defined on
      a 64-bit architecture, because index_of_flipped_bit <= 63, but is in
      general undefined on a 32-bit architecture if index_of_flipped_bit > 31.
      The fix just replaces 1UL with 1ULL.
      Signed-off-by: NPaolo Valente <paolo.valente@unimore.it>
      Reviewed-by: NFabio Checconi <fchecconi@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      624b85fb
    • P
      pkt_sched: sch_qfq: properly cap timestamps in charge_actual_service · 9b99b7e9
      Paolo Valente 提交于
      QFQ+ schedules the active aggregates in a group using a bucket list
      (one list per group). The bucket in which each aggregate is inserted
      depends on the aggregate's timestamps, and the number
      of buckets in a group is enough to accomodate the possible (range of)
      values of the timestamps of all the aggregates in the group. For this
      property to hold, timestamps must however be computed correctly.  One
      necessary condition for computing timestamps correctly is that the
      number of bits dequeued for each aggregate, while the aggregate is in
      service, does not exceed the maximum budget budgetmax assigned to the
      aggregate.
      
      For each aggregate, budgetmax is proportional to the number of classes
      in the aggregate. If the number of classes of the aggregate is
      decreased through qfq_change_class(), then budgetmax is decreased
      automatically as well.  Problems may occur if the aggregate is in
      service when budgetmax is decreased, because the current remaining
      budget of the aggregate and/or the service already received by the
      aggregate may happen to be larger than the new value of budgetmax.  In
      this case, when the aggregate is eventually deselected and its
      timestamps are updated, the aggregate may happen to have received an
      amount of service larger than budgetmax.  This may cause the aggregate
      to be assigned a higher virtual finish time than the maximum
      acceptable value for the last bucket in the bucket list of the group.
      
      This fix introduces a cap that addresses this issue.
      Signed-off-by: NPaolo Valente <paolo.valente@unimore.it>
      Reviewed-by: NFabio Checconi <fchecconi@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9b99b7e9
  4. 28 2月, 2013 1 次提交
    • S
      hlist: drop the node parameter from iterators · b67bfe0d
      Sasha Levin 提交于
      I'm not sure why, but the hlist for each entry iterators were conceived
      
              list_for_each_entry(pos, head, member)
      
      The hlist ones were greedy and wanted an extra parameter:
      
              hlist_for_each_entry(tpos, pos, head, member)
      
      Why did they need an extra pos parameter? I'm not quite sure. Not only
      they don't really need it, it also prevents the iterator from looking
      exactly like the list iterator, which is unfortunate.
      
      Besides the semantic patch, there was some manual work required:
      
       - Fix up the actual hlist iterators in linux/list.h
       - Fix up the declaration of other iterators based on the hlist ones.
       - A very small amount of places were using the 'node' parameter, this
       was modified to use 'obj->member' instead.
       - Coccinelle didn't handle the hlist_for_each_entry_safe iterator
       properly, so those had to be fixed up manually.
      
      The semantic patch which is mostly the work of Peter Senna Tschudin is here:
      
      @@
      iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;
      
      type T;
      expression a,c,d,e;
      identifier b;
      statement S;
      @@
      
      -T b;
          <+... when != b
      (
      hlist_for_each_entry(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_continue(a,
      - b,
      c) S
      |
      hlist_for_each_entry_from(a,
      - b,
      c) S
      |
      hlist_for_each_entry_rcu(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_rcu_bh(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_continue_rcu_bh(a,
      - b,
      c) S
      |
      for_each_busy_worker(a, c,
      - b,
      d) S
      |
      ax25_uid_for_each(a,
      - b,
      c) S
      |
      ax25_for_each(a,
      - b,
      c) S
      |
      inet_bind_bucket_for_each(a,
      - b,
      c) S
      |
      sctp_for_each_hentry(a,
      - b,
      c) S
      |
      sk_for_each(a,
      - b,
      c) S
      |
      sk_for_each_rcu(a,
      - b,
      c) S
      |
      sk_for_each_from
      -(a, b)
      +(a)
      S
      + sk_for_each_from(a) S
      |
      sk_for_each_safe(a,
      - b,
      c, d) S
      |
      sk_for_each_bound(a,
      - b,
      c) S
      |
      hlist_for_each_entry_safe(a,
      - b,
      c, d, e) S
      |
      hlist_for_each_entry_continue_rcu(a,
      - b,
      c) S
      |
      nr_neigh_for_each(a,
      - b,
      c) S
      |
      nr_neigh_for_each_safe(a,
      - b,
      c, d) S
      |
      nr_node_for_each(a,
      - b,
      c) S
      |
      nr_node_for_each_safe(a,
      - b,
      c, d) S
      |
      - for_each_gfn_sp(a, c, d, b) S
      + for_each_gfn_sp(a, c, d) S
      |
      - for_each_gfn_indirect_valid_sp(a, c, d, b) S
      + for_each_gfn_indirect_valid_sp(a, c, d) S
      |
      for_each_host(a,
      - b,
      c) S
      |
      for_each_host_safe(a,
      - b,
      c, d) S
      |
      for_each_mesh_entry(a,
      - b,
      c, d) S
      )
          ...+>
      
      [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
      [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
      [akpm@linux-foundation.org: checkpatch fixes]
      [akpm@linux-foundation.org: fix warnings]
      [akpm@linux-foudnation.org: redo intrusive kvm changes]
      Tested-by: NPeter Senna Tschudin <peter.senna@gmail.com>
      Acked-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b67bfe0d
  5. 19 2月, 2013 2 次提交
  6. 16 2月, 2013 1 次提交
  7. 13 2月, 2013 9 次提交
  8. 30 1月, 2013 1 次提交
    • J
      netem: fix delay calculation in rate extension · a13d3104
      Johannes Naab 提交于
      The delay calculation with the rate extension introduces in v3.3 does
      not properly work, if other packets are still queued for transmission.
      For the delay calculation to work, both delay types (latency and delay
      introduces by rate limitation) have to be handled differently. The
      latency delay for a packet can overlap with the delay of other packets.
      The delay introduced by the rate however is separate, and can only
      start, once all other rate-introduced delays finished.
      
      Latency delay is from same distribution for each packet, rate delay
      depends on the packet size.
      
      .: latency delay
      -: rate delay
      x: additional delay we have to wait since another packet is currently
         transmitted
      
        .....----                    Packet 1
          .....xx------              Packet 2
                     .....------     Packet 3
          ^^^^^
          latency stacks
               ^^
               rate delay doesn't stack
                     ^^
                     latency stacks
      
        -----> time
      
      When a packet is enqueued, we first consider the latency delay. If other
      packets are already queued, we can reduce the latency delay until the
      last packet in the queue is send, however the latency delay cannot be
      <0, since this would mean that the rate is overcommitted.  The new
      reference point is the time at which the last packet will be send. To
      find the time, when the packet should be send, the rate introduces delay
      has to be added on top of that.
      Signed-off-by: NJohannes Naab <jn@stusta.de>
      Acked-by: NHagen Paul Pfeifer <hagen@jauu.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a13d3104
  9. 15 1月, 2013 1 次提交
  10. 22 12月, 2012 1 次提交
  11. 12 12月, 2012 1 次提交
    • E
      pkt_sched: avoid requeues if possible · 1abbe139
      Eric Dumazet 提交于
      With BQL being deployed, we can more likely have following behavior :
      
      We dequeue a packet from qdisc in dequeue_skb(), then we realize target
      tx queue is in XOFF state in sch_direct_xmit(), and we have to hold the
      skb into gso_skb for later.
      
      This shows in stats (tc -s qdisc dev eth0) as requeues.
      
      Problem of these requeues is that high priority packets can not be
      dequeued as long as this (possibly low prio and big TSO packet) is not
      removed from gso_skb.
      
      At 1Gbps speed, a full size TSO packet is 500 us of extra latency.
      
      In some cases, we know that all packets dequeued from a qdisc are
      for a particular and known txq :
      
      - If device is non multi queue
      - For all MQ/MQPRIO slave qdiscs
      
      This patch introduces a new qdisc flag, TCQ_F_ONETXQUEUE to mark
      this capability, so that dequeue_skb() is allowed to dequeue a packet
      only if the associated txq is not stopped.
      
      This indeed reduce latencies for high prio packets (or improve fairness
      with sfq/fq_codel), and almost remove qdisc 'requeues'.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: John Fastabend <john.r.fastabend@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1abbe139
  12. 29 11月, 2012 1 次提交
  13. 26 11月, 2012 1 次提交
  14. 22 11月, 2012 1 次提交
  15. 20 11月, 2012 1 次提交
  16. 19 11月, 2012 1 次提交
  17. 08 11月, 2012 1 次提交
  18. 07 11月, 2012 1 次提交
    • E
      htb: fix two bugs · 196d97f6
      Eric Dumazet 提交于
      Commit 56b765b7 (htb: improved accuracy at high rates)
      introduced two bugs :
      
      1) one bstats_update() was inadvertently removed from
         htb_dequeue_tree(), breaking statistics/rate estimation.
      
      2) Missing qdisc_put_rtab() calls in htb_change_class(),
         leaking kernel memory, now struct htb_class no longer
         retains pointers to qdisc_rate_table structs.
      
         Since only rate is used, dont use qdisc_get_rtab() calls
         copying data we ignore anyway.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Vimalkumar <j.vimal@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      196d97f6
  19. 04 11月, 2012 1 次提交
    • V
      htb: improved accuracy at high rates · 56b765b7
      Vimalkumar 提交于
      Current HTB (and TBF) uses rate table computed by the "tc"
      userspace program, which has the following issue:
      
      The rate table has 256 entries to map packet lengths
      to token (time units).  With TSO sized packets, the
      256 entry granularity leads to loss/gain of rate,
      making the token bucket inaccurate.
      
      Thus, instead of relying on rate table, this patch
      explicitly computes the time and accounts for packet
      transmission times with nanosecond granularity.
      
      This greatly improves accuracy of HTB with a wide
      range of packet sizes.
      
      Example:
      
      tc qdisc add dev $dev root handle 1: \
              htb default 1
      
      tc class add dev $dev classid 1:1 parent 1: \
              rate 5Gbit mtu 64k
      
      Here is an example of inaccuracy:
      
      $ iperf -c host -t 10 -i 1
      
      With old htb:
      eth4:   34.76 Mb/s In  5827.98 Mb/s Out -  65836.0 p/s In  481273.0 p/s Out
      [SUM]  9.0-10.0 sec   669 MBytes  5.61 Gbits/sec
      [SUM]  0.0-10.0 sec  6.50 GBytes  5.58 Gbits/sec
      
      With new htb:
      eth4:   28.36 Mb/s In  5208.06 Mb/s Out -  53704.0 p/s In  430076.0 p/s Out
      [SUM]  9.0-10.0 sec   594 MBytes  4.98 Gbits/sec
      [SUM]  0.0-10.0 sec  5.80 GBytes  4.98 Gbits/sec
      
      The bits per second on the wire is still 5200Mb/s with new HTB
      because qdisc accounts for packet length using skb->len, which
      is smaller than total bytes on the wire if GSO is used.  But
      that is for another patch regardless of how time is accounted.
      
      Many thanks to Eric Dumazet for review and feedback.
      Signed-off-by: NVimalkumar <j.vimal@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      56b765b7
  20. 26 10月, 2012 1 次提交
    • D
      cgroup: net_cls: Rework update socket logic · 6a328d8c
      Daniel Wagner 提交于
      The cgroup logic part of net_cls is very similar as the one in
      net_prio. Let's stream line the net_cls logic with the net_prio one.
      
      The net_prio update logic was changed by following commit (note there
      were some changes necessary later on)
      
      commit 406a3c63
      Author: John Fastabend <john.r.fastabend@intel.com>
      Date:   Fri Jul 20 10:39:25 2012 +0000
      
          net: netprio_cgroup: rework update socket logic
      
          Instead of updating the sk_cgrp_prioidx struct field on every send
          this only updates the field when a task is moved via cgroup
          infrastructure.
      
          This allows sockets that may be used by a kernel worker thread
          to be managed. For example in the iscsi case today a user can
          put iscsid in a netprio cgroup and control traffic will be sent
          with the correct sk_cgrp_prioidx value set but as soon as data
          is sent the kernel worker thread isssues a send and sk_cgrp_prioidx
          is updated with the kernel worker threads value which is the
          default case.
      
          It seems more correct to only update the field when the user
          explicitly sets it via control group infrastructure. This allows
          the users to manage sockets that may be used with other threads.
      
      Since classid is now updated when the task is moved between the
      cgroups, we don't have to call sock_update_classid() from various
      places to ensure we always using the latest classid value.
      
      [v2: Use iterate_fd() instead of open coding]
      Signed-off-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Cc:  Li Zefan <lizefan@huawei.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: John Fastabend <john.r.fastabend@intel.com>
      Cc: Neil Horman <nhorman@tuxdriver.com>
      Cc: Stanislav Kinsbursky <skinsbursky@parallels.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: <netdev@vger.kernel.org>
      Cc: <cgroups@vger.kernel.org>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6a328d8c
  21. 22 10月, 2012 1 次提交
  22. 28 9月, 2012 1 次提交
    • D
      pkt_sched: Fix warning false positives. · f54ba779
      David S. Miller 提交于
      GCC refuses to recognize that all error control flows do in fact
      set err to something.
      
      Add an explicit initialization to shut it up.
      
      net/sched/sch_drr.c: In function ‘drr_enqueue’:
      net/sched/sch_drr.c:359:11: warning: ‘err’ may be used uninitialized in this function [-Wmaybe-uninitialized]
      net/sched/sch_qfq.c: In function ‘qfq_enqueue’:
      net/sched/sch_qfq.c:885:11: warning: ‘err’ may be used uninitialized in this function [-Wmaybe-uninitialized]
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f54ba779
  23. 25 9月, 2012 1 次提交
    • E
      net: use a per task frag allocator · 5640f768
      Eric Dumazet 提交于
      We currently use a per socket order-0 page cache for tcp_sendmsg()
      operations.
      
      This page is used to build fragments for skbs.
      
      Its done to increase probability of coalescing small write() into
      single segments in skbs still in write queue (not yet sent)
      
      But it wastes a lot of memory for applications handling many mostly
      idle sockets, since each socket holds one page in sk->sk_sndmsg_page
      
      Its also quite inefficient to build TSO 64KB packets, because we need
      about 16 pages per skb on arches where PAGE_SIZE = 4096, so we hit
      page allocator more than wanted.
      
      This patch adds a per task frag allocator and uses bigger pages,
      if available. An automatic fallback is done in case of memory pressure.
      
      (up to 32768 bytes per frag, thats order-3 pages on x86)
      
      This increases TCP stream performance by 20% on loopback device,
      but also benefits on other network devices, since 8x less frags are
      mapped on transmit and unmapped on tx completion. Alexander Duyck
      mentioned a probable performance win on systems with IOMMU enabled.
      
      Its possible some SG enabled hardware cant cope with bigger fragments,
      but their ndo_start_xmit() should already handle this, splitting a
      fragment in sub fragments, since some arches have PAGE_SIZE=65536
      
      Successfully tested on various ethernet devices.
      (ixgbe, igb, bnx2x, tg3, mellanox mlx4)
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Ben Hutchings <bhutchings@solarflare.com>
      Cc: Vijay Subramanian <subramanian.vijay@gmail.com>
      Cc: Alexander Duyck <alexander.h.duyck@intel.com>
      Tested-by: NVijay Subramanian <subramanian.vijay@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5640f768
  24. 20 9月, 2012 1 次提交
    • P
      pkt_sched: fix virtual-start-time update in QFQ · 71261956
      Paolo Valente 提交于
      If the old timestamps of a class, say cl, are stale when the class
      becomes active, then QFQ may assign to cl a much higher start time
      than the maximum value allowed. This may happen when QFQ assigns to
      the start time of cl the finish time of a group whose classes are
      characterized by a higher value of the ratio
      max_class_pkt/weight_of_the_class with respect to that of
      cl. Inserting a class with a too high start time into the bucket list
      corrupts the data structure and may eventually lead to crashes.
      This patch limits the maximum start time assigned to a class.
      Signed-off-by: NPaolo Valente <paolo.valente@unimore.it>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      71261956
  25. 15 9月, 2012 2 次提交
    • T
      cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them · 8c7f6edb
      Tejun Heo 提交于
      Currently, cgroup hierarchy support is a mess.  cpu related subsystems
      behave correctly - configuration, accounting and control on a parent
      properly cover its children.  blkio and freezer completely ignore
      hierarchy and treat all cgroups as if they're directly under the root
      cgroup.  Others show yet different behaviors.
      
      These differing interpretations of cgroup hierarchy make using cgroup
      confusing and it impossible to co-mount controllers into the same
      hierarchy and obtain sane behavior.
      
      Eventually, we want full hierarchy support from all subsystems and
      probably a unified hierarchy.  Users using separate hierarchies
      expecting completely different behaviors depending on the mounted
      subsystem is deterimental to making any progress on this front.
      
      This patch adds cgroup_subsys.broken_hierarchy and sets it to %true
      for controllers which are lacking in hierarchy support.  The goal of
      this patch is two-fold.
      
      * Move users away from using hierarchy on currently non-hierarchical
        subsystems, so that implementing proper hierarchy support on those
        doesn't surprise them.
      
      * Keep track of which controllers are broken how and nudge the
        subsystems to implement proper hierarchy support.
      
      For now, start with a single warning message.  We can whine louder
      later on.
      
      v2: Fixed a typo spotted by Michal. Warning message updated.
      
      v3: Updated memcg part so that it doesn't generate warning in the
          cases where .use_hierarchy=false doesn't make the behavior
          different from root.use_hierarchy=true.  Fixed a typo spotted by
          Glauber.
      
      v4: Check ->broken_hierarchy after cgroup creation is complete so that
          ->create() can affect the result per Michal.  Dropped unnecessary
          memcg root handling per Michal.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NSerge E. Hallyn <serue@us.ibm.com>
      Cc: Glauber Costa <glommer@parallels.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Paul Turner <pjt@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Thomas Graf <tgraf@suug.ch>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
      Cc: Neil Horman <nhorman@tuxdriver.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      8c7f6edb
    • D
      cgroup: Assign subsystem IDs during compile time · 8a8e04df
      Daniel Wagner 提交于
      WARNING: With this change it is impossible to load external built
      controllers anymore.
      
      In case where CONFIG_NETPRIO_CGROUP=m and CONFIG_NET_CLS_CGROUP=m is
      set, corresponding subsys_id should also be a constant. Up to now,
      net_prio_subsys_id and net_cls_subsys_id would be of the type int and
      the value would be assigned during runtime.
      
      By switching the macro definition IS_SUBSYS_ENABLED from IS_BUILTIN
      to IS_ENABLED, all *_subsys_id will have constant value. That means we
      need to remove all the code which assumes a value can be assigned to
      net_prio_subsys_id and net_cls_subsys_id.
      
      A close look is necessary on the RCU part which was introduces by
      following patch:
      
        commit f8451725
        Author:	Herbert Xu <herbert@gondor.apana.org.au>  Mon May 24 09:12:34 2010
        Committer:	David S. Miller <davem@davemloft.net>  Mon May 24 09:12:34 2010
      
        cls_cgroup: Store classid in struct sock
      
        Tis code was added to init_cgroup_cls()
      
      	  /* We can't use rcu_assign_pointer because this is an int. */
      	  smp_wmb();
      	  net_cls_subsys_id = net_cls_subsys.subsys_id;
      
        respectively to exit_cgroup_cls()
      
      	  net_cls_subsys_id = -1;
      	  synchronize_rcu();
      
        and in module version of task_cls_classid()
      
      	  rcu_read_lock();
      	  id = rcu_dereference(net_cls_subsys_id);
      	  if (id >= 0)
      		  classid = container_of(task_subsys_state(p, id),
      					 struct cgroup_cls_state, css)->classid;
      	  rcu_read_unlock();
      
      Without an explicit explaination why the RCU part is needed. (The
      rcu_deference was fixed by exchanging it to rcu_derefence_index_check()
      in a later commit, but that is a minor detail.)
      
      So here is my pondering why it was introduced and why it safe to
      remove it now. Note that this code was copied over to net_prio the
      reasoning holds for that subsystem too.
      
      The idea behind the RCU use for net_cls_subsys_id is to make sure we
      get a valid pointer back from task_subsys_state(). task_subsys_state()
      is just blindly accessing the subsys array and returning the
      pointer. Obviously, passing in -1 as id into task_subsys_state()
      returns an invalid value (out of lower bound).
      
      So this code makes sure that only after module is loaded and the
      subsystem registered, the id is assigned.
      
      Before unregistering the module all old readers must have left the
      critical section. This is done by assigning -1 to the id and issuing a
      synchronized_rcu(). Any new readers wont call task_subsys_state()
      anymore and therefore it is safe to unregister the subsystem.
      
      The new code relies on the same trick, but it looks at the subsys
      pointer return by task_subsys_state() (remember the id is constant
      and therefore we allways have a valid index into the subsys
      array).
      
      No precautions need to be taken during module loading
      module. Eventually, all CPUs will get a valid pointer back from
      task_subsys_state() because rebind_subsystem() which is called after
      the module init() function will assigned subsys[net_cls_subsys_id] the
      newly loaded module subsystem pointer.
      
      When the subsystem is about to be removed, rebind_subsystem() will
      called before the module exit() function. In this case,
      rebind_subsys() will assign subsys[net_cls_subsys_id] a NULL pointer
      and then it calls synchronize_rcu(). All old readers have left by then
      the critical section. Any new reader wont access the subsystem
      anymore.  At this point we are safe to unregister the subsystem. No
      synchronize_rcu() call is needed.
      Signed-off-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Gao feng <gaofeng@cn.fujitsu.com>
      Cc: Glauber Costa <glommer@parallels.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: John Fastabend <john.r.fastabend@intel.com>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: netdev@vger.kernel.org
      Cc: cgroups@vger.kernel.org
      8a8e04df
新手
引导
客服 返回
顶部