- 09 12月, 2017 2 次提交
-
-
由 John Fastabend 提交于
This converts the pfifo_fast qdisc to use the skb_array data structure and set the lockless qdisc bit. pfifo_fast is the first qdisc to support the lockless bit that can be a child of a qdisc requiring locking. So we add logic to clear the lock bit on initialization in these cases when the qdisc graft operation occurs. This also removes the logic used to pick the next band to dequeue from and instead just checks a per priority array for packets from top priority to lowest. This might need to be a bit more clever but seems to work for now. Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 John Fastabend 提交于
Add qdisc qlen helper routines for lockless qdiscs to use. The qdisc qlen is no longer used in the hotpath but it is reported via stats query on the qdisc so it still needs to be tracked. This adds the per cpu operations needed along with a helper to return the summation of per cpu stats. Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 06 12月, 2017 2 次提交
-
-
由 Alexander Aring 提交于
This patch fixes the following checkpatch error: ERROR: do not use assignment in if condition by rearranging the if condition to execute init callback only if init callback exists. The whole setup afterwards is called in any case, doesn't matter if init callback is set or not. This patch has the same behaviour as before, just without assign err variable in if condition. It also makes the code easier to read. Reviewed-by: NJamal Hadi Salim <jhs@mojatatu.com> Cc: David Ahern <dsahern@gmail.com> Signed-off-by: NAlexander Aring <aring@mojatatu.com> Acked-by: NJamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Alexander Aring 提交于
This patch fix checkpatch issues for upcomming patches according to the sched api file. It changes checking on null pointer, remove unnecessary brackets, add variable names for parameters and adjust 80 char width. Cc: David Ahern <dsahern@gmail.com> Signed-off-by: NAlexander Aring <aring@mojatatu.com> Acked-by: NJamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 29 10月, 2017 1 次提交
-
-
由 Cong Wang 提交于
Davide found the following script triggers a NULL pointer dereference: ip l a name eth0 type dummy tc q a dev eth0 parent :1 handle 1: htb This is because for a freshly created netdevice noop_qdisc is attached and when passing 'parent :1', kernel actually tries to match the major handle which is 0 and noop_qdisc has handle 0 so is matched by mistake. Commit 69012ae4 tries to fix a similar bug but still misses this case. Handle 0 is not a valid one, should be just skipped. In fact, kernel uses it as TC_H_UNSPEC. Fixes: 69012ae4 ("net: sched: fix handling of singleton qdiscs with qdisc_hash") Fixes: 59cc1f61 ("net: sched:convert qdisc linked list to hashtable") Reported-by: NDavide Caratti <dcaratti@redhat.com> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Eric Dumazet <edumazet@google.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 17 10月, 2017 1 次提交
-
-
由 Jiri Pirko 提交于
Use tcf_block_q helper to get q pointer to be used for direct call of sch_tree_lock/unlock instead of tcf_tree_lock/unlock. Signed-off-by: NJiri Pirko <jiri@mellanox.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 19 9月, 2017 1 次提交
-
-
由 Colin Ian King 提交于
Pointer tcm is being initialized and is never read, it is only being used to determine the size of struct tcmsg. Clean this up by removing variable tcm and explicitly using the sizeof struct tcmsg rather than *tcm. Cleans up clang warning: warning: Value stored to 'tcm' during its initialization is never read Signed-off-by: NColin Ian King <colin.king@canonical.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 01 9月, 2017 1 次提交
-
-
由 Cong Wang 提交于
TC filters when used as classifiers are bound to TC classes. However, there is a hidden difference when adding them in different orders: 1. If we add tc classes before its filters, everything is fine. Logically, the classes exist before we specify their ID's in filters, it is easy to bind them together, just as in the current code base. 2. If we add tc filters before the tc classes they bind, we have to do dynamic lookup in fast path. What's worse, this happens all the time not just once, because on fast path tcf_result is passed on stack, there is no way to propagate back to the one in tc filters. This hidden difference hurts performance silently if we have many tc classes in hierarchy. This patch intends to close this gap by doing the reverse binding when we create a new class, in this case we can actually search all the filters in its parent, match and fixup by classid. And because tcf_result is specific to each type of tc filter, we have to introduce a new ops for each filter to tell how to bind the class. Note, we still can NOT totally get rid of those class lookup in ->enqueue() because cgroup and flow filters have no way to determine the classid at setup time, they still have to go through dynamic lookup. Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 26 8月, 2017 3 次提交
-
-
由 WANG Cong 提交于
For TC classes, their ->get() and ->put() are always paired, and the reference counting is completely useless, because: 1) For class modification and dumping paths, we already hold RTNL lock, so all of these ->get(),->change(),->put() are atomic. 2) For filter bindiing/unbinding, we use other reference counter than this one, and they should have RTNL lock too. 3) For ->qlen_notify(), it is special because it is called on ->enqueue() path, but we already hold qdisc tree lock there, and we hold this tree lock when graft or delete the class too, so it should not be gone or changed until we release the tree lock. Therefore, this patch removes ->get() and ->put(), but: 1) Adds a new ->find() to find the pointer to a class by classid, no refcnt. 2) Move the original class destroy upon the last refcnt into ->delete(), right after releasing tree lock. This is fine because the class is already removed from hash when holding the lock. For those who also use ->put() as ->unbind(), just rename them to reflect this change. Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com> Acked-by: NJiri Pirko <jiri@mellanox.com> Acked-by: NJamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 WANG Cong 提交于
Like for TC actions, ->delete() is a special case, we have to prepare and fill the notification before delete otherwise would get use-after-free after we remove the reference count. Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com> Acked-by: NJamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 WANG Cong 提交于
This is not needed if we move them up properly. Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com> Acked-by: NJiri Pirko <jiri@mellanox.com> Acked-by: NJamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 25 8月, 2017 1 次提交
-
-
由 Eric Dumazet 提交于
syzkaller reported a refcount_t warning [1] Issue here is that noop_qdisc refcnt was never really considered as a true refcount, since qdisc_destroy() found TCQ_F_BUILTIN set : if (qdisc->flags & TCQ_F_BUILTIN || !refcount_dec_and_test(&qdisc->refcnt))) return; Meaning that all atomic_inc() we did on noop_qdisc.refcnt were not really needed, but harmless until refcount_t came. To fix this problem, we simply need to not increment noop_qdisc.refcnt, since we never decrement it. [1] refcount_t: increment on 0; use-after-free. ------------[ cut here ]------------ WARNING: CPU: 0 PID: 21754 at lib/refcount.c:152 refcount_inc+0x47/0x50 lib/refcount.c:152 Kernel panic - not syncing: panic_on_warn set ... CPU: 0 PID: 21754 Comm: syz-executor7 Not tainted 4.13.0-rc6+ #20 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:16 [inline] dump_stack+0x194/0x257 lib/dump_stack.c:52 panic+0x1e4/0x417 kernel/panic.c:180 __warn+0x1c4/0x1d9 kernel/panic.c:541 report_bug+0x211/0x2d0 lib/bug.c:183 fixup_bug+0x40/0x90 arch/x86/kernel/traps.c:190 do_trap_no_signal arch/x86/kernel/traps.c:224 [inline] do_trap+0x260/0x390 arch/x86/kernel/traps.c:273 do_error_trap+0x120/0x390 arch/x86/kernel/traps.c:310 do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:323 invalid_op+0x1e/0x30 arch/x86/entry/entry_64.S:846 RIP: 0010:refcount_inc+0x47/0x50 lib/refcount.c:152 RSP: 0018:ffff8801c43477a0 EFLAGS: 00010282 RAX: 000000000000002b RBX: ffffffff86093c14 RCX: 0000000000000000 RDX: 000000000000002b RSI: ffffffff8159314e RDI: ffffed0038868ee8 RBP: ffff8801c43477a8 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff86093ac0 R13: 0000000000000001 R14: ffff8801d0f3bac0 R15: dffffc0000000000 attach_default_qdiscs net/sched/sch_generic.c:792 [inline] dev_activate+0x7d3/0xaa0 net/sched/sch_generic.c:833 __dev_open+0x227/0x330 net/core/dev.c:1380 __dev_change_flags+0x695/0x990 net/core/dev.c:6726 dev_change_flags+0x88/0x140 net/core/dev.c:6792 dev_ifsioc+0x5a6/0x930 net/core/dev_ioctl.c:256 dev_ioctl+0x2bc/0xf90 net/core/dev_ioctl.c:554 sock_do_ioctl+0x94/0xb0 net/socket.c:968 sock_ioctl+0x2c2/0x440 net/socket.c:1058 vfs_ioctl fs/ioctl.c:45 [inline] do_vfs_ioctl+0x1b1/0x1520 fs/ioctl.c:685 SYSC_ioctl fs/ioctl.c:700 [inline] SyS_ioctl+0x8f/0xc0 fs/ioctl.c:691 Fixes: 7b936405 ("net, sched: convert Qdisc.refcnt from atomic_t to refcount_t") Signed-off-by: NEric Dumazet <edumazet@google.com> Reported-by: NDmitry Vyukov <dvyukov@google.com> Cc: Reshetova, Elena <elena.reshetova@intel.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 23 8月, 2017 1 次提交
-
-
由 Eric Dumazet 提交于
High order GFP_KERNEL allocations can stress the host badly. Use modern kvmalloc_array()/kvfree() instead of custom allocations. Signed-off-by: NEric Dumazet <edumazet@google.com> Acked-by: NAlexei Starovoitov <ast@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 17 8月, 2017 1 次提交
-
-
由 Konstantin Khlebnikov 提交于
This callback is used for deactivating class in parent qdisc. This is cheaper to test queue length right here. Also this allows to catch draining screwed backlog and prevent second deactivation of already inactive parent class which will crash kernel for sure. Kernel with print warning at destruction of child qdisc where no packets but backlog is not zero. Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 16 8月, 2017 1 次提交
-
-
由 Konstantin Khlebnikov 提交于
It was added in commit e57a784d ("pkt_sched: set root qdisc before change() in attach_default_qdiscs()") to hide duplicates from "tc qdisc show" for incative deivices. After 59cc1f61 ("net: sched: convert qdisc linked list to hashtable") it triggered when classful qdisc is added to inactive device because default qdiscs are added before switching root qdisc. Anyway after commit ea327469 ("net: sched: avoid duplicates in qdisc dump") duplicates are filtered right in dumper. Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 10 8月, 2017 1 次提交
-
-
由 Florian Westphal 提交于
This change allows us to later indicate to rtnetlink core that certain doit functions should be called without acquiring rtnl_mutex. This change should have no effect, we simply replace the last (now unused) calcit argument with the new flag. Signed-off-by: NFlorian Westphal <fw@strlen.de> Reviewed-by: NHannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 05 7月, 2017 1 次提交
-
-
由 Reshetova, Elena 提交于
refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: NElena Reshetova <elena.reshetova@intel.com> Signed-off-by: NHans Liljestrand <ishkamiel@gmail.com> Signed-off-by: NKees Cook <keescook@chromium.org> Signed-off-by: NDavid Windsor <dwindsor@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 30 6月, 2017 1 次提交
-
-
由 Gao Feng 提交于
When qdisc fail to init, qdisc_create would invoke the destroy callback to cleanup. But there is no check if the callback exists really. So it would cause the panic if there is no real destroy callback like the qdisc codel, fq, and so on. Take codel as an example following: When a malicious user constructs one invalid netlink msg, it would cause codel_init->codel_change->nla_parse_nested failed. Then kernel would invoke the destroy callback directly but qdisc codel doesn't define one. It causes one panic as a result. Now add one the check for destroy to avoid the possible panic. Fixes: 87b60cfa ("net_sched: fix error recovery at qdisc creation") Signed-off-by: NGao Feng <gfree.wind@vip.163.com> Acked-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 18 5月, 2017 2 次提交
-
-
由 Jiri Pirko 提交于
Currently, the filter chains are direcly put into the private structures of qdiscs. In order to be able to have multiple chains per qdisc and to allow filter chains sharing among qdiscs, there is a need for common object that would hold the chains. This introduces such object and calls it "tcf_block". Helpers to get and put the blocks are provided to be called from individual qdisc code. Also, the original filter_list pointers are left in qdisc privs to allow the entry into tcf_block processing without any added overhead of possible multiple pointer dereference on fast path. Signed-off-by: NJiri Pirko <jiri@mellanox.com> Acked-by: NJamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jiri Pirko 提交于
Move tc_classify function to cls_api.c where it belongs, rename it to fit the namespace. Signed-off-by: NJiri Pirko <jiri@mellanox.com> Acked-by: NJamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 12 5月, 2017 1 次提交
-
-
由 Eric Dumazet 提交于
In commit 59cc1f61 ("net: sched: convert qdisc linked list to hashtable") we missed the opportunity to considerably speed up tc_dump_tclass_root() if a qdisc handle is provided by user. Instead of iterating all the qdiscs, use qdisc_match_from_root() to directly get the one we look for. Signed-off-by: NEric Dumazet <edumazet@google.com> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Jiri Pirko <jiri@resnulli.us> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 18 4月, 2017 2 次提交
-
-
由 David Ahern 提交于
Add netlink_ext_ack arg to rtnl_doit_func. Pass extack arg to nlmsg_parse for doit functions that call it directly. This is the first step to using extended error reporting in rtnetlink. >From here individual subsystems can be updated to set netlink_ext_ack as needed. Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 stephen hemminger 提交于
Since 3.12 it has been possible to configure the default queuing discipline via sysctl. This patch adds ability to configure the default queue discipline in kernel configuration. This is useful for environments where configuring the value from userspace is difficult to manage. The default is still the same as before (pfifo_fast) and it is possible to change after kernel init with sysctl. This is similar to how TCP congestion control works. Signed-off-by: NStephen Hemminger <sthemmin@microsoft.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 14 4月, 2017 1 次提交
-
-
由 Johannes Berg 提交于
Pass the new extended ACK reporting struct to all of the generic netlink parsing functions. For now, pass NULL in almost all callers (except for some in the core.) Signed-off-by: NJohannes Berg <johannes.berg@intel.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 13 3月, 2017 1 次提交
-
-
由 Jiri Kosina 提交于
The original reason [1] for having hidden qdiscs (potential scalability issues in qdisc_match_from_root() with single linked list in case of large amount of qdiscs) has been invalidated by 59cc1f61 ("net: sched: convert qdisc linked list to hashtable"). This allows us for bringing more clarity and determinism into the dump by making default pfifo qdiscs visible. We're not turning this on by default though, at it was deemed [2] too intrusive / unnecessary change of default behavior towards userspace. Instead, TCA_DUMP_INVISIBLE netlink attribute is introduced, which allows applications to request complete qdisc hierarchy dump, including the ones that have always been implicit/invisible. Singleton noop_qdisc stays invisible, as teaching the whole infrastructure about singletons would require quite some surgery with very little gain (seeing no qdisc or seeing noop qdisc in the dump is probably setting the same user expectation). [1] http://lkml.kernel.org/r/1460732328.10638.74.camel@edumazet-glaptop3.roam.corp.google.com [2] http://lkml.kernel.org/r/20161021.105935.1907696543877061916.davem@davemloft.netSigned-off-by: NJiri Kosina <jkosina@suse.cz> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 18 2月, 2017 1 次提交
-
-
由 Gao Feng 提交于
The qdisc_stab_lock is used in qdisc_get_stab and qdisc_put_stab. These two functions are invoked in qdisc_create, qdisc_change, and qdisc_destroy which run fully under RTNL. So it already makes sure only one could access the qdisc_stab_list at the same time. Then it is unnecessary to use qdisc_stab_lock now. Signed-off-by: NGao Feng <fgao@ikuai8.com> Acked-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 12 2月, 2017 1 次提交
-
-
由 Eric Dumazet 提交于
Dmitry reported uses after free in qdisc code [1] The problem here is that ops->init() can return an error. qdisc_create_dflt() then call ops->destroy(), while qdisc_create() does _not_ call it. Four qdisc chose to call their own ops->destroy(), assuming their caller would not. This patch makes sure qdisc_create() calls ops->destroy() and fixes the four qdisc to avoid double free. [1] BUG: KASAN: use-after-free in mq_destroy+0x242/0x290 net/sched/sch_mq.c:33 at addr ffff8801d415d440 Read of size 8 by task syz-executor2/5030 CPU: 0 PID: 5030 Comm: syz-executor2 Not tainted 4.3.5-smp-DEV #119 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 0000000000000046 ffff8801b435b870 ffffffff81bbbed4 ffff8801db000400 ffff8801d415d440 ffff8801d415dc40 ffff8801c4988510 ffff8801b435b898 ffffffff816682b1 ffff8801b435b928 ffff8801d415d440 ffff8801c49880c0 Call Trace: [<ffffffff81bbbed4>] __dump_stack lib/dump_stack.c:15 [inline] [<ffffffff81bbbed4>] dump_stack+0x6c/0x98 lib/dump_stack.c:51 [<ffffffff816682b1>] kasan_object_err+0x21/0x70 mm/kasan/report.c:158 [<ffffffff81668524>] print_address_description mm/kasan/report.c:196 [inline] [<ffffffff81668524>] kasan_report_error+0x1b4/0x4b0 mm/kasan/report.c:285 [<ffffffff81668953>] kasan_report mm/kasan/report.c:305 [inline] [<ffffffff81668953>] __asan_report_load8_noabort+0x43/0x50 mm/kasan/report.c:326 [<ffffffff82527b02>] mq_destroy+0x242/0x290 net/sched/sch_mq.c:33 [<ffffffff82524bdd>] qdisc_destroy+0x12d/0x290 net/sched/sch_generic.c:953 [<ffffffff82524e30>] qdisc_create_dflt+0xf0/0x120 net/sched/sch_generic.c:848 [<ffffffff8252550d>] attach_default_qdiscs net/sched/sch_generic.c:1029 [inline] [<ffffffff8252550d>] dev_activate+0x6ad/0x880 net/sched/sch_generic.c:1064 [<ffffffff824b1db1>] __dev_open+0x221/0x320 net/core/dev.c:1403 [<ffffffff824b24ce>] __dev_change_flags+0x15e/0x3e0 net/core/dev.c:6858 [<ffffffff824b27de>] dev_change_flags+0x8e/0x140 net/core/dev.c:6926 [<ffffffff824f5bf6>] dev_ifsioc+0x446/0x890 net/core/dev_ioctl.c:260 [<ffffffff824f61fa>] dev_ioctl+0x1ba/0xb80 net/core/dev_ioctl.c:546 [<ffffffff82430509>] sock_do_ioctl+0x99/0xb0 net/socket.c:879 [<ffffffff82430d30>] sock_ioctl+0x2a0/0x390 net/socket.c:958 [<ffffffff816f3b68>] vfs_ioctl fs/ioctl.c:44 [inline] [<ffffffff816f3b68>] do_vfs_ioctl+0x8a8/0xe50 fs/ioctl.c:611 [<ffffffff816f41a4>] SYSC_ioctl fs/ioctl.c:626 [inline] [<ffffffff816f41a4>] SyS_ioctl+0x94/0xc0 fs/ioctl.c:617 [<ffffffff8123e357>] entry_SYSCALL_64_fastpath+0x12/0x17 Signed-off-by: NEric Dumazet <edumazet@google.com> Reported-by: NDmitry Vyukov <dvyukov@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 11 2月, 2017 2 次提交
-
-
由 Jiri Pirko 提交于
Creation is done in this file, move destruction to be at the same place. Signed-off-by: NJiri Pirko <jiri@mellanox.com> Acked-by: NJamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jiri Pirko 提交于
This function destroys TC filter protocol, not TC filter. So name it accordingly. Signed-off-by: NJiri Pirko <jiri@mellanox.com> Acked-by: NJamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 09 1月, 2017 1 次提交
-
-
由 Willem de Bruijn 提交于
This field is no longer kept in tc_verd. Remove it from the global definition of that struct. Signed-off-by: NWillem de Bruijn <willemb@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 06 12月, 2016 1 次提交
-
-
由 Eric Dumazet 提交于
1) Old code was hard to maintain, due to complex lock chains. (We probably will be able to remove some kfree_rcu() in callers) 2) Using a single timer to update all estimators does not scale. 3) Code was buggy on 32bit kernel (WRITE_ONCE() on 64bit quantity is not supposed to work well) In this rewrite : - I removed the RB tree that had to be scanned in gen_estimator_active(). qdisc dumps should be much faster. - Each estimator has its own timer. - Estimations are maintained in net_rate_estimator structure, instead of dirtying the qdisc. Minor, but part of the simplification. - Reading the estimator uses RCU and a seqcount to provide proper support for 32bit kernels. - We reduce memory need when estimators are not used, since we store a pointer, instead of the bytes/packets counters. - xt_rateest_mt() no longer has to grab a spinlock. (In the future, xt_rateest_tg() could be switched to per cpu counters) Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 08 11月, 2016 1 次提交
-
-
由 Jesper Dangaard Brouer 提交于
It is a clear misconfiguration to attach a qdisc to a device with tx_queue_len zero, because some qdisc's (namely, pfifo, bfifo, gred, htb, plug and sfb) inherit/copy this value as their queue length. Why should the kernel catch such a misconfiguration? Because prior to introducing the IFF_NO_QUEUE device flag, userspace found a loophole in the qdisc config system that allowed them to achieve the equivalent of IFF_NO_QUEUE, which is to remove the qdisc code path entirely from a device. The loophole on older kernels is setting tx_queue_len=0, *prior* to device qdisc init (the config time is significant, simply setting tx_queue_len=0 doesn't trigger the loophole). This loophole is currently used by Docker[1] to get better performance and scalability out of the veth device. The Docker developers were warned[1] that they needed to adjust the tx_queue_len if ever attaching a qdisc. The OpenShift project didn't remember this warning and attached a qdisc, this were caught and fixed in[2]. [1] https://github.com/docker/libcontainer/pull/193 [2] https://github.com/openshift/origin/pull/11126 Instead of fixing every userspace program that used this loophole, and forgot to reset the tx_queue_len, prior to attaching a qdisc. Let's catch the misconfiguration on the kernel side. Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 20 9月, 2016 1 次提交
-
-
由 Jamal Hadi Salim 提交于
Signed-off-by: NJamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 19 8月, 2016 2 次提交
-
-
由 Jiri Kosina 提交于
tc_dump_qdisc() performs dumping of the per-device qdiscs in two phases; first, the "standard" dev->qdisc is being dumped. Second, if there is/are ingress queue(s), they are being dumped as well. After conversion of netdevice's qdisc linked-list into hashtable, these two sets are not in two disjunctive sets/lists any more, but are both "reachable" directly from netdevice's hashtable. As a consequence, the "full-depth" dump of the ingress qdiscs results in immediately hitting the netdevice hashtable again, and duplicating the dump that has already been performed for dev->qdisc. What in fact needs to be dumped in case of ingress queue is "just" the top-level ingress qdisc, as everything else has been dumped already. Fix this by extending tc_dump_qdisc_root() in a way that it can be instructed whether it should (while performing the "full" per-netdev qdisc dump) perform the whole recursion, or just dump "additional" top-level (ingress) qdiscs without performing any kind of recursion. This fixes duplicate dumps such as qdisc mq 0: root qdisc pfifo_fast 0: parent :4 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: parent :3 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: parent :2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: parent :1 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc clsact ffff: parent ffff:fff1 qdisc pfifo_fast 0: parent :4 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: parent :3 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: parent :2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: parent :1 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 Fixes: 59cc1f61 ("net: sched: convert qdisc linked list to hashtable") Reported-by: NDaniel Borkmann <daniel@iogearbox.net> Tested-by: NDaniel Borkmann <daniel@iogearbox.net> Signed-off-by: NJiri Kosina <jkosina@suse.cz> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jiri Kosina 提交于
qdisc_match_from_root() is now iterating over per-netdevice qdisc hashtable instead of going through a linked-list of qdiscs (independently on the actual underlying netdev), which was the case before the switch to hashtable for qdiscs. For singleton qdiscs, there is no underlying netdev associated though, and therefore dumping a singleton qdisc will panic, as qdisc_dev(root) will always be NULL. BUG: unable to handle kernel NULL pointer dereference at 0000000000000410 IP: [<ffffffff8167efac>] qdisc_match_from_root+0x2c/0x70 PGD 1aceba067 PUD 1aceb7067 PMD 0 Oops: 0000 [#1] PREEMPT SMP [ ... ] task: ffff8801ec996e00 task.stack: ffff8801ec934000 RIP: 0010:[<ffffffff8167efac>] [<ffffffff8167efac>] qdisc_match_from_root+0x2c/0x70 RSP: 0018:ffff8801ec937ab0 EFLAGS: 00010203 RAX: 0000000000000408 RBX: ffff88025e612000 RCX: ffffffffffffffd8 RDX: 0000000000000000 RSI: 00000000ffff0000 RDI: ffffffff81cf8100 RBP: ffff8801ec937ab0 R08: 000000000001c160 R09: ffff8802668032c0 R10: ffffffff81cf8100 R11: 0000000000000030 R12: 00000000ffff0000 R13: ffff88025e612000 R14: ffffffff81cf3140 R15: 0000000000000000 FS: 00007f24b9af6740(0000) GS:ffff88026f280000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000410 CR3: 00000001aceec000 CR4: 00000000001406e0 Stack: ffff8801ec937ad0 ffffffff81681210 ffff88025dd51a00 00000000fffffff1 ffff8801ec937b88 ffffffff81681e4e ffffffff81c42bc0 ffff880262431500 ffffffff81cf3140 ffff88025dd51a10 ffff88025dd51a24 00000000ec937b38 Call Trace: [<ffffffff81681210>] qdisc_lookup+0x40/0x50 [<ffffffff81681e4e>] tc_modify_qdisc+0x21e/0x550 [<ffffffff8166ae25>] rtnetlink_rcv_msg+0x95/0x220 [<ffffffff81209602>] ? __kmalloc_track_caller+0x172/0x230 [<ffffffff8166ad90>] ? rtnl_newlink+0x870/0x870 [<ffffffff816897b7>] netlink_rcv_skb+0xa7/0xc0 [<ffffffff816657c8>] rtnetlink_rcv+0x28/0x30 [<ffffffff8168919b>] netlink_unicast+0x15b/0x210 [<ffffffff81689569>] netlink_sendmsg+0x319/0x390 [<ffffffff816379f8>] sock_sendmsg+0x38/0x50 [<ffffffff81638296>] ___sys_sendmsg+0x256/0x260 [<ffffffff811b1275>] ? __pagevec_lru_add_fn+0x135/0x280 [<ffffffff811b1a90>] ? pagevec_lru_move_fn+0xd0/0xf0 [<ffffffff811b1140>] ? trace_event_raw_event_mm_lru_insertion+0x180/0x180 [<ffffffff811b1b85>] ? __lru_cache_add+0x75/0xb0 [<ffffffff817708a6>] ? _raw_spin_unlock+0x16/0x40 [<ffffffff811d8dff>] ? handle_mm_fault+0x39f/0x1160 [<ffffffff81638b15>] __sys_sendmsg+0x45/0x80 [<ffffffff81638b62>] SyS_sendmsg+0x12/0x20 [<ffffffff810038e7>] do_syscall_64+0x57/0xb0 Fix this by special-casing singleton qdiscs (those that don't have underlying netdevice) and introduce immediate handling of those rather than trying to go over an underlying netdevice. We're in the same situation in tc_dump_qdisc_root() and tc_dump_tclass_root(). Ultimately, this will have to be slightly reworked so that we are actually able to show singleton qdiscs (noop) in the dump properly; but we're not currently doing that anyway, so no regression there, and better do this in a gradual manner. Fixes: 59cc1f61 ("net: sched: convert qdisc linked list to hashtable") Reported-by: NDaniel Borkmann <daniel@iogearbox.net> Tested-by: NDaniel Borkmann <daniel@iogearbox.net> Reported-by: NDavid Ahern <dsa@cumulusnetworks.com> Tested-by: NDavid Ahern <dsa@cumulusnetworks.com> Signed-off-by: NJiri Kosina <jkosina@suse.cz> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 11 8月, 2016 1 次提交
-
-
由 Jiri Kosina 提交于
Convert the per-device linked list into a hashtable. The primary motivation for this change is that currently, we're not tracking all the qdiscs in hierarchy (e.g. excluding default qdiscs), as the lookup performed over the linked list by qdisc_match_from_root() is rather expensive. The ultimate goal is to get rid of hidden qdiscs completely, which will bring much more determinism in user experience. Reviewed-by: NCong Wang <xiyou.wangcong@gmail.com> Signed-off-by: NJiri Kosina <jkosina@suse.cz> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 13 6月, 2016 1 次提交
-
-
由 Florian Westphal 提交于
sch_atm returns this when TC_ACT_SHOT classification occurs. But all other schedulers that use tc_classify (htb, hfsc, drr, fq_codel ...) return NET_XMIT_SUCCESS | __BYPASS in this case so just do that in atm. BATMAN uses it as an intermediate return value to signal forwarding vs. buffering, but it did not return POLICED to callers outside of BATMAN. Reviewed-by: NSven Eckelmann <sven@narfation.org> Signed-off-by: NFlorian Westphal <fw@strlen.de> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 11 6月, 2016 1 次提交
-
-
由 Eric Dumazet 提交于
__QDISC_STATE_THROTTLED bit manipulation is rather expensive for HTB and few others. I already removed it for sch_fq in commit f2600cf0 ("net: sched: avoid costly atomic operation in fq_dequeue()") and so far nobody complained. When one ore more packets are stuck in one or more throttled HTB class, a htb dequeue() performs two atomic operations to clear/set __QDISC_STATE_THROTTLED bit, while root qdisc lock is held. Removing this pair of atomic operations bring me a 8 % performance increase on 200 TCP_RR tests, in presence of throttled classes. This patch has no side effect, since nothing actually uses disc_is_throttled() anymore. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 08 6月, 2016 1 次提交
-
-
由 Eric Dumazet 提交于
Large tc dumps (tc -s {qdisc|class} sh dev ethX) done by Google BwE host agent [1] are problematic at scale : For each qdisc/class found in the dump, we currently lock the root qdisc spinlock in order to get stats. Sampling stats every 5 seconds from thousands of HTB classes is a challenge when the root qdisc spinlock is under high pressure. Not only the dumps take time, they also slow down the fast path (queue/dequeue packets) by 10 % to 20 % in some cases. An audit of existing qdiscs showed that sch_fq_codel is the only qdisc that might need the qdisc lock in fq_codel_dump_stats() and fq_codel_dump_class_stats() In v2 of this patch, I now use the Qdisc running seqcount to provide consistent reads of packets/bytes counters, regardless of 32/64 bit arches. I also changed rate estimators to use the same infrastructure so that they no longer need to lock root qdisc lock. [1] http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43838.pdfSigned-off-by: NEric Dumazet <edumazet@google.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Kevin Athey <kda@google.com> Cc: Xiaotian Pei <xiaotian@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 25 5月, 2016 1 次提交
-
-
由 Eric Dumazet 提交于
I found a serious performance bug in packet schedulers using hrtimers. sch_htb and sch_fq are definitely impacted by this problem. We constantly rearm high resolution timers if some packets are throttled in one (or more) class, and other packets are flying through qdisc on another (non throttled) class. hrtimer_start() does not have the mod_timer() trick of doing nothing if expires value does not change : if (timer_pending(timer) && timer->expires == expires) return 1; This issue is particularly visible when multiple cpus can queue/dequeue packets on the same qdisc, as hrtimer code has to lock a remote base. I used following fix : 1) Change htb to use qdisc_watchdog_schedule_ns() instead of open-coding it. 2) Cache watchdog prior expiration. hrtimer might provide this, but I prefer to not rely on some hrtimer internal. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-