- 30 3月, 2013 1 次提交
-
-
由 Vijay Subramanian 提交于
Currently, we hold a max of sch->limit -1 number of packets instead of sch->limit packets. Fix this off-by-one error. Signed-off-by: NVijay Subramanian <subramanian.vijay@gmail.com> Acked-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 28 3月, 2013 1 次提交
-
-
由 Sergey Popovich 提交于
It seems that commit commit 292f1c7f Author: Jiri Pirko <jiri@resnulli.us> Date: Tue Feb 12 00:12:03 2013 +0000 sch: make htb_rate_cfg and functions around that generic adds little regression. Before: # tc qdisc add dev eth0 root handle 1: htb default ffff # tc class add dev eth0 classid 1:ffff htb rate 5Gbit # tc -s class show dev eth0 class htb 1:ffff root prio 0 rate 5000Mbit ceil 5000Mbit burst 625b cburst 625b Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) rate 0bit 0pps backlog 0b 0p requeues 0 lended: 0 borrowed: 0 giants: 0 tokens: 31 ctokens: 31 After: # tc qdisc add dev eth0 root handle 1: htb default ffff # tc class add dev eth0 classid 1:ffff htb rate 5Gbit # tc -s class show dev eth0 class htb 1:ffff root prio 0 rate 1544Mbit ceil 1544Mbit burst 625b cburst 625b Sent 5073 bytes 41 pkt (dropped 0, overlimits 0 requeues 0) rate 1976bit 2pps backlog 0b 0p requeues 0 lended: 41 borrowed: 0 giants: 0 tokens: 1802 ctokens: 1802 This probably due to lost u64 cast of rate parameter in psched_ratecfg_precompute() (net/sched/sch_generic.c). Signed-off-by: NSergey Popovich <popovich_sergei@mail.ru> Acked-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 06 3月, 2013 6 次提交
-
-
由 Paolo Valente 提交于
QFQ+ can select for service only 'eligible' aggregates, i.e., aggregates that would have started to be served also in the emulated ideal system. As a consequence, for QFQ+ to be work conserving, at least one of the active aggregates must be eligible when it is time to choose the next aggregate to serve. The set of eligible aggregates is updated through the function qfq_update_eligible(), which does guarantee that, after its invocation, at least one of the active aggregates is eligible. Because of this property, this function is invoked in qfq_deactivate_agg() to guarantee that at least one of the active aggregates is still eligible after an aggregate has been deactivated. In particular, the critical case is when there are other active aggregates, but the aggregate being deactivated happens to be the only one eligible. However, this precaution is not needed for QFQ+ to be work conserving, because update_eligible() is always invoked also at the beginning of qfq_choose_next_agg(). This patch removes the additional invocation of update_eligible() in qfq_deactivate_agg(). Signed-off-by: NPaolo Valente <paolo.valente@unimore.it> Reviewed-by: NFabio Checconi <fchecconi@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Paolo Valente 提交于
By definition of (the algorithm of) QFQ+, the system virtual time must be pushed up only if there is no 'eligible' aggregate, i.e. no aggregate that would have started to be served also in the ideal system emulated by QFQ+. QFQ+ serves only eligible aggregates, hence the aggregate currently in service is eligible. As a consequence, to decide whether there is no eligible aggregate, QFQ+ must also check whether there is no aggregate in service. Signed-off-by: NPaolo Valente <paolo.valente@unimore.it> Reviewed-by: NFabio Checconi <fchecconi@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Paolo Valente 提交于
Aggregate budgets are computed so as to guarantee that, after an aggregate has been selected for service, that aggregate has enough budget to serve at least one maximum-size packet for the classes it contains. For this reason, after a new aggregate has been selected for service, its next packet is immediately dequeued, without any further control. The maximum packet size for a class, lmax, can be changed through qfq_change_class(). In case the user sets lmax to a lower value than the the size of some of the still-to-arrive packets, QFQ+ will automatically push up lmax as it enqueues these packets. This automatic push up is likely to happen with TSO/GSO. In any case, if lmax is assigned a lower value than the size of some of the packets already enqueued for the class, then the following problem may occur: the size of the next packet to dequeue for the class may happen to be larger than lmax, after the aggregate to which the class belongs has been just selected for service. In this case, even the budget of the aggregate, which is an unsigned value, may be lower than the size of the next packet to dequeue. After dequeueing this packet and subtracting its size from the budget, the latter would wrap around. This fix prevents the budget from wrapping around after any packet dequeue. Signed-off-by: NPaolo Valente <paolo.valente@unimore.it> Reviewed-by: NFabio Checconi <fchecconi@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Paolo Valente 提交于
If no aggregate is in service, then the function qfq_dequeue() does not dequeue any packet. For this reason, to guarantee QFQ+ to be work conserving, a just-activated aggregate must be set as in service immediately if it happens to be the only active aggregate. This is done by the function qfq_enqueue(). Unfortunately, the function qfq_add_to_agg(), used to add a class to an aggregate, does not perform this important additional operation. In particular, if: 1) qfq_add_to_agg() is invoked to complete the move of a class from a source aggregate, becoming, for this move, inactive, to a destination aggregate, becoming instead active, and 2) the destination aggregate becomes the only active aggregate, then this aggregate is not however set as in service. QFQ+ remains then in a non-work-conserving state until a new invocation of qfq_enqueue() recovers the situation. This fix solves the problem by moving the logic for setting an aggregate as in service directly into the function qfq_activate_agg(). Hence, from whatever point qfq_activate_aggregate() is invoked, QFQ+ remains work conserving. Since the more-complex logic of this new version of activate_aggregate() is not necessary, in qfq_dequeue(), to reschedule an aggregate that finishes its budget, then the aggregate is now rescheduled by invoking directly the functions needed. Signed-off-by: NPaolo Valente <paolo.valente@unimore.it> Reviewed-by: NFabio Checconi <fchecconi@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Paolo Valente 提交于
Between two invocations of make_eligible, the system virtual time may happen to grow enough that, in its binary representation, a bit with higher order than 31 flips. This happens especially with TSO/GSO. Before this fix, the mask used in make_eligible was computed as (1UL<<index_of_last_flipped_bit)-1, whose value is well defined on a 64-bit architecture, because index_of_flipped_bit <= 63, but is in general undefined on a 32-bit architecture if index_of_flipped_bit > 31. The fix just replaces 1UL with 1ULL. Signed-off-by: NPaolo Valente <paolo.valente@unimore.it> Reviewed-by: NFabio Checconi <fchecconi@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Paolo Valente 提交于
QFQ+ schedules the active aggregates in a group using a bucket list (one list per group). The bucket in which each aggregate is inserted depends on the aggregate's timestamps, and the number of buckets in a group is enough to accomodate the possible (range of) values of the timestamps of all the aggregates in the group. For this property to hold, timestamps must however be computed correctly. One necessary condition for computing timestamps correctly is that the number of bits dequeued for each aggregate, while the aggregate is in service, does not exceed the maximum budget budgetmax assigned to the aggregate. For each aggregate, budgetmax is proportional to the number of classes in the aggregate. If the number of classes of the aggregate is decreased through qfq_change_class(), then budgetmax is decreased automatically as well. Problems may occur if the aggregate is in service when budgetmax is decreased, because the current remaining budget of the aggregate and/or the service already received by the aggregate may happen to be larger than the new value of budgetmax. In this case, when the aggregate is eventually deselected and its timestamps are updated, the aggregate may happen to have received an amount of service larger than budgetmax. This may cause the aggregate to be assigned a higher virtual finish time than the maximum acceptable value for the last bucket in the bucket list of the group. This fix introduces a cap that addresses this issue. Signed-off-by: NPaolo Valente <paolo.valente@unimore.it> Reviewed-by: NFabio Checconi <fchecconi@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 28 2月, 2013 1 次提交
-
-
由 Sasha Levin 提交于
I'm not sure why, but the hlist for each entry iterators were conceived list_for_each_entry(pos, head, member) The hlist ones were greedy and wanted an extra parameter: hlist_for_each_entry(tpos, pos, head, member) Why did they need an extra pos parameter? I'm not quite sure. Not only they don't really need it, it also prevents the iterator from looking exactly like the list iterator, which is unfortunate. Besides the semantic patch, there was some manual work required: - Fix up the actual hlist iterators in linux/list.h - Fix up the declaration of other iterators based on the hlist ones. - A very small amount of places were using the 'node' parameter, this was modified to use 'obj->member' instead. - Coccinelle didn't handle the hlist_for_each_entry_safe iterator properly, so those had to be fixed up manually. The semantic patch which is mostly the work of Peter Senna Tschudin is here: @@ iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host; type T; expression a,c,d,e; identifier b; statement S; @@ -T b; <+... when != b ( hlist_for_each_entry(a, - b, c, d) S | hlist_for_each_entry_continue(a, - b, c) S | hlist_for_each_entry_from(a, - b, c) S | hlist_for_each_entry_rcu(a, - b, c, d) S | hlist_for_each_entry_rcu_bh(a, - b, c, d) S | hlist_for_each_entry_continue_rcu_bh(a, - b, c) S | for_each_busy_worker(a, c, - b, d) S | ax25_uid_for_each(a, - b, c) S | ax25_for_each(a, - b, c) S | inet_bind_bucket_for_each(a, - b, c) S | sctp_for_each_hentry(a, - b, c) S | sk_for_each(a, - b, c) S | sk_for_each_rcu(a, - b, c) S | sk_for_each_from -(a, b) +(a) S + sk_for_each_from(a) S | sk_for_each_safe(a, - b, c, d) S | sk_for_each_bound(a, - b, c) S | hlist_for_each_entry_safe(a, - b, c, d, e) S | hlist_for_each_entry_continue_rcu(a, - b, c) S | nr_neigh_for_each(a, - b, c) S | nr_neigh_for_each_safe(a, - b, c, d) S | nr_node_for_each(a, - b, c) S | nr_node_for_each_safe(a, - b, c, d) S | - for_each_gfn_sp(a, c, d, b) S + for_each_gfn_sp(a, c, d) S | - for_each_gfn_indirect_valid_sp(a, c, d, b) S + for_each_gfn_indirect_valid_sp(a, c, d) S | for_each_host(a, - b, c) S | for_each_host_safe(a, - b, c, d) S | for_each_mesh_entry(a, - b, c, d) S ) ...+> [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c] [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c] [akpm@linux-foundation.org: checkpatch fixes] [akpm@linux-foundation.org: fix warnings] [akpm@linux-foudnation.org: redo intrusive kvm changes] Tested-by: NPeter Senna Tschudin <peter.senna@gmail.com> Acked-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: NSasha Levin <sasha.levin@oracle.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Gleb Natapov <gleb@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 19 2月, 2013 2 次提交
-
-
由 Gao feng 提交于
proc_net_remove is only used to remove proc entries that under /proc/net,it's not a general function for removing proc entries of netns. if we want to remove some proc entries which under /proc/net/stat/, we still need to call remove_proc_entry. this patch use remove_proc_entry to replace proc_net_remove. we can remove proc_net_remove after this patch. Signed-off-by: NGao feng <gaofeng@cn.fujitsu.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Gao feng 提交于
Right now, some modules such as bonding use proc_create to create proc entries under /proc/net/, and other modules such as ipv4 use proc_net_fops_create. It looks a little chaos.this patch changes all of proc_net_fops_create to proc_create. we can remove proc_net_fops_create after this patch. Signed-off-by: NGao feng <gaofeng@cn.fujitsu.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 16 2月, 2013 1 次提交
-
-
由 Pravin B Shelar 提交于
This function will be used in next GRE_GSO patch. This patch does not change any functionality. Signed-off-by: NPravin B Shelar <pshelar@nicira.com> Acked-by: NEric Dumazet <edumazet@google.com>
-
- 13 2月, 2013 9 次提交
-
-
由 Jiri Pirko 提交于
Current act_police uses rate table computed by the "tc" userspace program, which has the following issue: The rate table has 256 entries to map packet lengths to token (time units). With TSO sized packets, the 256 entry granularity leads to loss/gain of rate, making the token bucket inaccurate. Thus, instead of relying on rate table, this patch explicitly computes the time and accounts for packet transmission times with nanosecond granularity. This is a followup to 56b765b7 ("htb: improved accuracy at high rates"). Signed-off-by: NJiri Pirko <jiri@resnulli.us> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jiri Pirko 提交于
It's not used anywhere else, so move it. Signed-off-by: NJiri Pirko <jiri@resnulli.us> Acked-by: NJamal Hadi Salim <jhs@mojatatu.com> Acked-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jiri Pirko 提交于
Current TBF uses rate table computed by the "tc" userspace program, which has the following issue: The rate table has 256 entries to map packet lengths to token (time units). With TSO sized packets, the 256 entry granularity leads to loss/gain of rate, making the token bucket inaccurate. Thus, instead of relying on rate table, this patch explicitly computes the time and accounts for packet transmission times with nanosecond granularity. This is a followup to 56b765b7 ("htb: improved accuracy at high rates"). Signed-off-by: NJiri Pirko <jiri@resnulli.us> Acked-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jiri Pirko 提交于
tbf will need to schedule watchdog in ns. No need to convert it twice. Signed-off-by: NJiri Pirko <jiri@resnulli.us> Acked-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jiri Pirko 提交于
As it is going to be used in tbf as well, push these to generic code. Signed-off-by: NJiri Pirko <jiri@resnulli.us> Acked-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jiri Pirko 提交于
These are in ns so convert from ticks to ns. Signed-off-by: NJiri Pirko <jiri@resnulli.us> Acked-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jiri Pirko 提交于
These are initialized correctly a couple of lines later in the function. Signed-off-by: NJiri Pirko <jiri@resnulli.us> Acked-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jiri Pirko 提交于
Signed-off-by: NJiri Pirko <jiri@resnulli.us> Acked-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jiri Pirko 提交于
in htb_change_class() cl->buffer and cl->buffer are stored in ns. So in dump, convert them back to psched ticks. Note this was introduced by: commit 56b765b7 htb: improved accuracy at high rates Please consider this for -net/-stable. Signed-off-by: NJiri Pirko <jiri@resnulli.us> Acked-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 30 1月, 2013 1 次提交
-
-
由 Johannes Naab 提交于
The delay calculation with the rate extension introduces in v3.3 does not properly work, if other packets are still queued for transmission. For the delay calculation to work, both delay types (latency and delay introduces by rate limitation) have to be handled differently. The latency delay for a packet can overlap with the delay of other packets. The delay introduced by the rate however is separate, and can only start, once all other rate-introduced delays finished. Latency delay is from same distribution for each packet, rate delay depends on the packet size. .: latency delay -: rate delay x: additional delay we have to wait since another packet is currently transmitted .....---- Packet 1 .....xx------ Packet 2 .....------ Packet 3 ^^^^^ latency stacks ^^ rate delay doesn't stack ^^ latency stacks -----> time When a packet is enqueued, we first consider the latency delay. If other packets are already queued, we can reduce the latency delay until the last packet in the queue is send, however the latency delay cannot be <0, since this would mean that the rate is overcommitted. The new reference point is the time at which the last packet will be send. To find the time, when the packet should be send, the rate introduces delay has to be added on top of that. Signed-off-by: NJohannes Naab <jn@stusta.de> Acked-by: NHagen Paul Pfeifer <hagen@jauu.net> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 15 1月, 2013 1 次提交
-
-
由 Benjamin LaHaise 提交于
Eric Dumazet pointed out that act_mirred needs to find the current net_ns, and struct net pointer is not provided in the call chain. His original patch made use of current->nsproxy->net_ns to find the network namespace, but this fails to work correctly for userspace code that makes use of netlink sockets in different network namespaces. Instead, pass the "struct net *" down along the call chain to where it is needed. This version removes the ifb changes as Eric has submitted that patch separately, but is otherwise identical to the previous version. Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org> Tested-by: NEric Dumazet <eric.dumazet@gmail.com> Acked-by: NJamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 22 12月, 2012 1 次提交
-
-
由 Stefan Hasko 提交于
Fixed integer overflow in function htb_dequeue Signed-off-by: NStefan Hasko <hasko.stevo@gmail.com> Acked-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 12 12月, 2012 1 次提交
-
-
由 Eric Dumazet 提交于
With BQL being deployed, we can more likely have following behavior : We dequeue a packet from qdisc in dequeue_skb(), then we realize target tx queue is in XOFF state in sch_direct_xmit(), and we have to hold the skb into gso_skb for later. This shows in stats (tc -s qdisc dev eth0) as requeues. Problem of these requeues is that high priority packets can not be dequeued as long as this (possibly low prio and big TSO packet) is not removed from gso_skb. At 1Gbps speed, a full size TSO packet is 500 us of extra latency. In some cases, we know that all packets dequeued from a qdisc are for a particular and known txq : - If device is non multi queue - For all MQ/MQPRIO slave qdiscs This patch introduces a new qdisc flag, TCQ_F_ONETXQUEUE to mark this capability, so that dequeue_skb() is allowed to dequeue a packet only if the associated txq is not stopped. This indeed reduce latencies for high prio packets (or improve fairness with sfq/fq_codel), and almost remove qdisc 'requeues'. Signed-off-by: NEric Dumazet <edumazet@google.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 29 11月, 2012 1 次提交
-
-
由 Paolo Valente 提交于
This patch turns QFQ into QFQ+, a variant of QFQ that provides the following two benefits: 1) QFQ+ is faster than QFQ, 2) differently from QFQ, QFQ+ correctly schedules also non-leaves classes in a hierarchical setting. A detailed description of QFQ+, plus a performance comparison with DRR and QFQ, can be found in [1]. [1] P. Valente, "Reducing the Execution Time of Fair-Queueing Schedulers" http://algo.ing.unimo.it/people/paolo/agg-sched/agg-sched.pdfSigned-off-by: NPaolo Valente <paolo.valente@unimore.it> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 26 11月, 2012 1 次提交
-
-
由 Marc Kleine-Budde 提交于
This patch makes it possible to build the CAN Identifier into the kernel, even if the CAN support is build as a module. Signed-off-by: NMarc Kleine-Budde <mkl@pengutronix.de> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 22 11月, 2012 1 次提交
-
-
由 Tejun Heo 提交于
It turns out that we'll have to live with attributes which are inherited at cgroup creation time but not affected by further updates to the parent afterwards - such attributes are already in wide use e.g. for cpuset. So, there's nothing to do for netcls_cgroup for hierarchy support. Its current behavior - inherit only during creation - is good enough. Move config inheriting from ->css_alloc() to ->css_online() for consistency, which doesn't change behavior at all, and remove .broken_hierarchy marking. Signed-off-by: NTejun Heo <tj@kernel.org> Tested-and-Acked-by: NDaniel Wagner <daniel.wagner@bmw-carit.de> Acked-by: NDavid S. Miller <davem@davemloft.net>
-
- 20 11月, 2012 1 次提交
-
-
由 Tejun Heo 提交于
Rename cgroup_subsys css lifetime related callbacks to better describe what their roles are. Also, update documentation. Signed-off-by: NTejun Heo <tj@kernel.org> Acked-by: NLi Zefan <lizefan@huawei.com>
-
- 19 11月, 2012 1 次提交
-
-
由 Eric W. Biederman 提交于
- In rtnetlink_rcv_msg convert the capable(CAP_NET_ADMIN) check to ns_capable(net->user-ns, CAP_NET_ADMIN). Allowing unprivileged users to make netlink calls to modify their local network namespace. - In the rtnetlink doit methods add capable(CAP_NET_ADMIN) so that calls that are not safe for unprivileged users are still protected. Later patches will remove the extra capable calls from methods that are safe for unprivilged users. Acked-by: NSerge Hallyn <serge.hallyn@canonical.com> Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 08 11月, 2012 1 次提交
-
-
由 Paolo Valente 提交于
If the max packet size for some class (configured through tc) is violated by the actual size of the packets of that class, then QFQ would not schedule classes correctly, and the data structures implementing the bucket lists may get corrupted. This problem occurs with TSO/GSO even if the max packet size is set to the MTU, and is, e.g., the cause of the failure reported in [1]. Two patches have been proposed to solve this problem in [2], one of them is a preliminary version of this patch. This patch addresses the above issues by: 1) setting QFQ parameters to proper values for supporting TSO/GSO (in particular, setting the maximum possible packet size to 64KB), 2) automatically increasing the max packet size for a class, lmax, when a packet with a larger size than the current value of lmax arrives. The drawback of the first point is that the maximum weight for a class is now limited to 4096, which is equal to 1/16 of the maximum weight sum. Finally, this patch also forcibly caps the timestamps of a class if they are too high to be stored in the bucket list. This capping, taken from QFQ+ [3], handles the unfrequent case described in the comment to the function slot_insert. [1] http://marc.info/?l=linux-netdev&m=134968777902077&w=2 [2] http://marc.info/?l=linux-netdev&m=135096573507936&w=2 [3] http://marc.info/?l=linux-netdev&m=134902691421670&w=2Signed-off-by: NPaolo Valente <paolo.valente@unimore.it> Tested-by: NCong Wang <amwang@redhat.com> Acked-by: NStephen Hemminger <shemminger@vyatta.com> Acked-by: NStephen Hemminger <shemminger@vyatta.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 07 11月, 2012 1 次提交
-
-
由 Eric Dumazet 提交于
Commit 56b765b7 (htb: improved accuracy at high rates) introduced two bugs : 1) one bstats_update() was inadvertently removed from htb_dequeue_tree(), breaking statistics/rate estimation. 2) Missing qdisc_put_rtab() calls in htb_change_class(), leaking kernel memory, now struct htb_class no longer retains pointers to qdisc_rate_table structs. Since only rate is used, dont use qdisc_get_rtab() calls copying data we ignore anyway. Signed-off-by: NEric Dumazet <edumazet@google.com> Cc: Vimalkumar <j.vimal@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 04 11月, 2012 1 次提交
-
-
由 Vimalkumar 提交于
Current HTB (and TBF) uses rate table computed by the "tc" userspace program, which has the following issue: The rate table has 256 entries to map packet lengths to token (time units). With TSO sized packets, the 256 entry granularity leads to loss/gain of rate, making the token bucket inaccurate. Thus, instead of relying on rate table, this patch explicitly computes the time and accounts for packet transmission times with nanosecond granularity. This greatly improves accuracy of HTB with a wide range of packet sizes. Example: tc qdisc add dev $dev root handle 1: \ htb default 1 tc class add dev $dev classid 1:1 parent 1: \ rate 5Gbit mtu 64k Here is an example of inaccuracy: $ iperf -c host -t 10 -i 1 With old htb: eth4: 34.76 Mb/s In 5827.98 Mb/s Out - 65836.0 p/s In 481273.0 p/s Out [SUM] 9.0-10.0 sec 669 MBytes 5.61 Gbits/sec [SUM] 0.0-10.0 sec 6.50 GBytes 5.58 Gbits/sec With new htb: eth4: 28.36 Mb/s In 5208.06 Mb/s Out - 53704.0 p/s In 430076.0 p/s Out [SUM] 9.0-10.0 sec 594 MBytes 4.98 Gbits/sec [SUM] 0.0-10.0 sec 5.80 GBytes 4.98 Gbits/sec The bits per second on the wire is still 5200Mb/s with new HTB because qdisc accounts for packet length using skb->len, which is smaller than total bytes on the wire if GSO is used. But that is for another patch regardless of how time is accounted. Many thanks to Eric Dumazet for review and feedback. Signed-off-by: NVimalkumar <j.vimal@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 26 10月, 2012 1 次提交
-
-
由 Daniel Wagner 提交于
The cgroup logic part of net_cls is very similar as the one in net_prio. Let's stream line the net_cls logic with the net_prio one. The net_prio update logic was changed by following commit (note there were some changes necessary later on) commit 406a3c63 Author: John Fastabend <john.r.fastabend@intel.com> Date: Fri Jul 20 10:39:25 2012 +0000 net: netprio_cgroup: rework update socket logic Instead of updating the sk_cgrp_prioidx struct field on every send this only updates the field when a task is moved via cgroup infrastructure. This allows sockets that may be used by a kernel worker thread to be managed. For example in the iscsi case today a user can put iscsid in a netprio cgroup and control traffic will be sent with the correct sk_cgrp_prioidx value set but as soon as data is sent the kernel worker thread isssues a send and sk_cgrp_prioidx is updated with the kernel worker threads value which is the default case. It seems more correct to only update the field when the user explicitly sets it via control group infrastructure. This allows the users to manage sockets that may be used with other threads. Since classid is now updated when the task is moved between the cgroups, we don't have to call sock_update_classid() from various places to ensure we always using the latest classid value. [v2: Use iterate_fd() instead of open coding] Signed-off-by: NDaniel Wagner <daniel.wagner@bmw-carit.de> Cc: Li Zefan <lizefan@huawei.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Joe Perches <joe@perches.com> Cc: John Fastabend <john.r.fastabend@intel.com> Cc: Neil Horman <nhorman@tuxdriver.com> Cc: Stanislav Kinsbursky <skinsbursky@parallels.com> Cc: Tejun Heo <tj@kernel.org> Cc: <netdev@vger.kernel.org> Cc: <cgroups@vger.kernel.org> Acked-by: NNeil Horman <nhorman@tuxdriver.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 22 10月, 2012 1 次提交
-
-
由 Eric Dumazet 提交于
ns_to_ktime() seems better than ktime_set() + ktime_add_ns() Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 28 9月, 2012 1 次提交
-
-
由 David S. Miller 提交于
GCC refuses to recognize that all error control flows do in fact set err to something. Add an explicit initialization to shut it up. net/sched/sch_drr.c: In function ‘drr_enqueue’: net/sched/sch_drr.c:359:11: warning: ‘err’ may be used uninitialized in this function [-Wmaybe-uninitialized] net/sched/sch_qfq.c: In function ‘qfq_enqueue’: net/sched/sch_qfq.c:885:11: warning: ‘err’ may be used uninitialized in this function [-Wmaybe-uninitialized] Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 25 9月, 2012 1 次提交
-
-
由 Eric Dumazet 提交于
We currently use a per socket order-0 page cache for tcp_sendmsg() operations. This page is used to build fragments for skbs. Its done to increase probability of coalescing small write() into single segments in skbs still in write queue (not yet sent) But it wastes a lot of memory for applications handling many mostly idle sockets, since each socket holds one page in sk->sk_sndmsg_page Its also quite inefficient to build TSO 64KB packets, because we need about 16 pages per skb on arches where PAGE_SIZE = 4096, so we hit page allocator more than wanted. This patch adds a per task frag allocator and uses bigger pages, if available. An automatic fallback is done in case of memory pressure. (up to 32768 bytes per frag, thats order-3 pages on x86) This increases TCP stream performance by 20% on loopback device, but also benefits on other network devices, since 8x less frags are mapped on transmit and unmapped on tx completion. Alexander Duyck mentioned a probable performance win on systems with IOMMU enabled. Its possible some SG enabled hardware cant cope with bigger fragments, but their ndo_start_xmit() should already handle this, splitting a fragment in sub fragments, since some arches have PAGE_SIZE=65536 Successfully tested on various ethernet devices. (ixgbe, igb, bnx2x, tg3, mellanox mlx4) Signed-off-by: NEric Dumazet <edumazet@google.com> Cc: Ben Hutchings <bhutchings@solarflare.com> Cc: Vijay Subramanian <subramanian.vijay@gmail.com> Cc: Alexander Duyck <alexander.h.duyck@intel.com> Tested-by: NVijay Subramanian <subramanian.vijay@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 20 9月, 2012 1 次提交
-
-
由 Paolo Valente 提交于
If the old timestamps of a class, say cl, are stale when the class becomes active, then QFQ may assign to cl a much higher start time than the maximum value allowed. This may happen when QFQ assigns to the start time of cl the finish time of a group whose classes are characterized by a higher value of the ratio max_class_pkt/weight_of_the_class with respect to that of cl. Inserting a class with a too high start time into the bucket list corrupts the data structure and may eventually lead to crashes. This patch limits the maximum start time assigned to a class. Signed-off-by: NPaolo Valente <paolo.valente@unimore.it> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 15 9月, 2012 2 次提交
-
-
由 Tejun Heo 提交于
Currently, cgroup hierarchy support is a mess. cpu related subsystems behave correctly - configuration, accounting and control on a parent properly cover its children. blkio and freezer completely ignore hierarchy and treat all cgroups as if they're directly under the root cgroup. Others show yet different behaviors. These differing interpretations of cgroup hierarchy make using cgroup confusing and it impossible to co-mount controllers into the same hierarchy and obtain sane behavior. Eventually, we want full hierarchy support from all subsystems and probably a unified hierarchy. Users using separate hierarchies expecting completely different behaviors depending on the mounted subsystem is deterimental to making any progress on this front. This patch adds cgroup_subsys.broken_hierarchy and sets it to %true for controllers which are lacking in hierarchy support. The goal of this patch is two-fold. * Move users away from using hierarchy on currently non-hierarchical subsystems, so that implementing proper hierarchy support on those doesn't surprise them. * Keep track of which controllers are broken how and nudge the subsystems to implement proper hierarchy support. For now, start with a single warning message. We can whine louder later on. v2: Fixed a typo spotted by Michal. Warning message updated. v3: Updated memcg part so that it doesn't generate warning in the cases where .use_hierarchy=false doesn't make the behavior different from root.use_hierarchy=true. Fixed a typo spotted by Glauber. v4: Check ->broken_hierarchy after cgroup creation is complete so that ->create() can affect the result per Michal. Dropped unnecessary memcg root handling per Michal. Signed-off-by: NTejun Heo <tj@kernel.org> Acked-by: NMichal Hocko <mhocko@suse.cz> Acked-by: NLi Zefan <lizefan@huawei.com> Acked-by: NSerge E. Hallyn <serue@us.ibm.com> Cc: Glauber Costa <glommer@parallels.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Paul Turner <pjt@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Thomas Graf <tgraf@suug.ch> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Cc: Neil Horman <nhorman@tuxdriver.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
-
由 Daniel Wagner 提交于
WARNING: With this change it is impossible to load external built controllers anymore. In case where CONFIG_NETPRIO_CGROUP=m and CONFIG_NET_CLS_CGROUP=m is set, corresponding subsys_id should also be a constant. Up to now, net_prio_subsys_id and net_cls_subsys_id would be of the type int and the value would be assigned during runtime. By switching the macro definition IS_SUBSYS_ENABLED from IS_BUILTIN to IS_ENABLED, all *_subsys_id will have constant value. That means we need to remove all the code which assumes a value can be assigned to net_prio_subsys_id and net_cls_subsys_id. A close look is necessary on the RCU part which was introduces by following patch: commit f8451725 Author: Herbert Xu <herbert@gondor.apana.org.au> Mon May 24 09:12:34 2010 Committer: David S. Miller <davem@davemloft.net> Mon May 24 09:12:34 2010 cls_cgroup: Store classid in struct sock Tis code was added to init_cgroup_cls() /* We can't use rcu_assign_pointer because this is an int. */ smp_wmb(); net_cls_subsys_id = net_cls_subsys.subsys_id; respectively to exit_cgroup_cls() net_cls_subsys_id = -1; synchronize_rcu(); and in module version of task_cls_classid() rcu_read_lock(); id = rcu_dereference(net_cls_subsys_id); if (id >= 0) classid = container_of(task_subsys_state(p, id), struct cgroup_cls_state, css)->classid; rcu_read_unlock(); Without an explicit explaination why the RCU part is needed. (The rcu_deference was fixed by exchanging it to rcu_derefence_index_check() in a later commit, but that is a minor detail.) So here is my pondering why it was introduced and why it safe to remove it now. Note that this code was copied over to net_prio the reasoning holds for that subsystem too. The idea behind the RCU use for net_cls_subsys_id is to make sure we get a valid pointer back from task_subsys_state(). task_subsys_state() is just blindly accessing the subsys array and returning the pointer. Obviously, passing in -1 as id into task_subsys_state() returns an invalid value (out of lower bound). So this code makes sure that only after module is loaded and the subsystem registered, the id is assigned. Before unregistering the module all old readers must have left the critical section. This is done by assigning -1 to the id and issuing a synchronized_rcu(). Any new readers wont call task_subsys_state() anymore and therefore it is safe to unregister the subsystem. The new code relies on the same trick, but it looks at the subsys pointer return by task_subsys_state() (remember the id is constant and therefore we allways have a valid index into the subsys array). No precautions need to be taken during module loading module. Eventually, all CPUs will get a valid pointer back from task_subsys_state() because rebind_subsystem() which is called after the module init() function will assigned subsys[net_cls_subsys_id] the newly loaded module subsystem pointer. When the subsystem is about to be removed, rebind_subsystem() will called before the module exit() function. In this case, rebind_subsys() will assign subsys[net_cls_subsys_id] a NULL pointer and then it calls synchronize_rcu(). All old readers have left by then the critical section. Any new reader wont access the subsystem anymore. At this point we are safe to unregister the subsystem. No synchronize_rcu() call is needed. Signed-off-by: NDaniel Wagner <daniel.wagner@bmw-carit.de> Signed-off-by: NTejun Heo <tj@kernel.org> Acked-by: NLi Zefan <lizefan@huawei.com> Acked-by: NNeil Horman <nhorman@tuxdriver.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Eric Dumazet <edumazet@google.com> Cc: Gao feng <gaofeng@cn.fujitsu.com> Cc: Glauber Costa <glommer@parallels.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: John Fastabend <john.r.fastabend@intel.com> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: netdev@vger.kernel.org Cc: cgroups@vger.kernel.org
-