- 01 9月, 2017 1 次提交
-
-
由 Cong Wang 提交于
TC filters when used as classifiers are bound to TC classes. However, there is a hidden difference when adding them in different orders: 1. If we add tc classes before its filters, everything is fine. Logically, the classes exist before we specify their ID's in filters, it is easy to bind them together, just as in the current code base. 2. If we add tc filters before the tc classes they bind, we have to do dynamic lookup in fast path. What's worse, this happens all the time not just once, because on fast path tcf_result is passed on stack, there is no way to propagate back to the one in tc filters. This hidden difference hurts performance silently if we have many tc classes in hierarchy. This patch intends to close this gap by doing the reverse binding when we create a new class, in this case we can actually search all the filters in its parent, match and fixup by classid. And because tcf_result is specific to each type of tc filter, we have to introduce a new ops for each filter to tell how to bind the class. Note, we still can NOT totally get rid of those class lookup in ->enqueue() because cgroup and flow filters have no way to determine the classid at setup time, they still have to go through dynamic lookup. Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 31 8月, 2017 6 次提交
-
-
由 David Ahern 提交于
IPv4 name uses "destination ip" as does the IPv6 patch set. Make the mac field consistent. Signed-off-by: NDavid Ahern <dsahern@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Ahmed Abdelsalam 提交于
IPv6 packet may carry more than one extension header, and IPv6 nodes must accept and attempt to process extension headers in any order and occurring any number of times in the same packet. Hence, there should be no assumption that Segment Routing extension header is to appear immediately after the IPv6 header. Moreover, section 4.1 of RFC 8200 gives a recommendation on the order of appearance of those extension headers within an IPv6 packet. According to this recommendation, Segment Routing extension header should appear after Hop-by-Hop and Destination Options headers (if they present). This patch fixes the get_srh(), so it gets the segment routing header regardless of its position in the chain of the extension headers in IPv6 packet, and makes sure that the IPv6 routing extension header is of Type 4. Signed-off-by: NAhmed Abdelsalam <amsalam20@gmail.com> Acked-by: NDavid Lebrun <david.lebrun@uclouvain.be> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Chris Mi 提交于
Typically, each TC filter has its own action. All the actions of the same type are saved in its hash table. But the hash buckets are too small that it degrades to a list. And the performance is greatly affected. For example, it takes about 0m11.914s to insert 64K rules. If we convert the hash table to IDR, it only takes about 0m1.500s. The improvement is huge. But please note that the test result is based on previous patch that cls_flower uses IDR. Signed-off-by: NChris Mi <chrism@mellanox.com> Signed-off-by: NJiri Pirko <jiri@mellanox.com> Acked-by: NJamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Chris Mi 提交于
Currently, all filters with the same priority are linked in a doubly linked list. Every filter should have a unique handle. To make the handle unique, we need to iterate the list every time to see if the handle exists or not when inserting a new filter. It is time-consuming. For example, it takes about 5m3.169s to insert 64K rules. This patch changes cls_flower to use IDR. With this patch, it takes about 0m1.127s to insert 64K rules. The improvement is huge. But please note that in this testing, all filters share the same action. If every filter has a unique action, that is another bottleneck. Follow-up patch in this patchset addresses that. Signed-off-by: NChris Mi <chrism@mellanox.com> Signed-off-by: NJiri Pirko <jiri@mellanox.com> Acked-by: NJamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Florian Westphal 提交于
This reverts commit 45f119bf. Eric Dumazet says: We found at Google a significant regression caused by 45f119bf tcp: remove header prediction In typical RPC (TCP_RR), when a TCP socket receives data, we now call tcp_ack() while we used to not call it. This touches enough cache lines to cause a slowdown. so problem does not seem to be HP removal itself but the tcp_ack() call. Therefore, it might be possible to remove HP after all, provided one finds a way to elide tcp_ack for most cases. Reported-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NFlorian Westphal <fw@strlen.de> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Florian Westphal 提交于
This change was a followup to the header prediction removal, so first revert this as a prerequisite to back out hp removal. Signed-off-by: NFlorian Westphal <fw@strlen.de> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 30 8月, 2017 10 次提交
-
-
由 Eric Dumazet 提交于
Florian reported UDP xmit drops that could be root caused to the too small neigh limit. Current limit is 64 KB, meaning that even a single UDP socket would hit it, since its default sk_sndbuf comes from net.core.wmem_default (~212992 bytes on 64bit arches). Once ARP/ND resolution is in progress, we should allow a little more packets to be queued, at least for one producer. Once neigh arp_queue is filled, a rogue socket should hit its sk_sndbuf limit and either block in sendmsg() or return -EAGAIN. Signed-off-by: NEric Dumazet <edumazet@google.com> Reported-by: NFlorian Fainelli <f.fainelli@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 David Ahern 提交于
Tariq repored local pings to linklocal address is failing: $ ifconfig ens8 ens8: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 11.141.16.6 netmask 255.255.0.0 broadcast 11.141.255.255 inet6 fe80::7efe:90ff:fecb:7502 prefixlen 64 scopeid 0x20<link> ether 7c:fe:90:cb:75:02 txqueuelen 1000 (Ethernet) RX packets 12 bytes 1164 (1.1 KiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 30 bytes 2484 (2.4 KiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 $ /bin/ping6 -c 3 fe80::7efe:90ff:fecb:7502%ens8 PING fe80::7efe:90ff:fecb:7502%ens8(fe80::7efe:90ff:fecb:7502) 56 data bytes Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jiri Benc 提交于
Add a new nsh/ directory. It currently holds only GSO functions but more will come: in particular, code shared by openvswitch and tc to manipulate NSH headers. For now, assume there's no hardware support for NSH segmentation. We can always introduce netdev->nsh_features later. Signed-off-by: NJiri Benc <jbenc@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Alexander Aring 提交于
This patch handles a default IFE type if it's not given by user space netlink api. The default IFE type will be the registered ethertype by IEEE for IFE ForCES. Signed-off-by: NAlexander Aring <aring@mojatatu.com> Acked-by: NJamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Roopa Prabhu 提交于
A few useful tracepoints to trace bridge forwarding database updates. Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com> Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com> Acked-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jesper Dangaard Brouer 提交于
Creating as specific xdp_redirect_map variant of the xdp tracepoints allow users to write simpler/faster BPF progs that get attached to these tracepoints. Goal is to still keep the tracepoints in xdp_redirect and xdp_redirect_map similar enough, that a tool can read the top part of the TP_STRUCT and produce similar monitor statistics. Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jesper Dangaard Brouer 提交于
There is a need to separate the xdp_redirect tracepoint into two tracepoints, for separating the error case from the normal forward case. Due to the extreme speeds XDP is operating at, loading a tracepoint have a measurable impact. Single core XDP REDIRECT (ethtool tuned rx-usecs 25) can do 13.7 Mpps forwarding, but loading a simple bpf_prog at the tracepoint (with a return 0) reduce perf to 10.2 Mpps (CPU E5-1650 v4 @ 3.60GHz, driver: ixgbe) The overhead of loading a bpf-based tracepoint can be calculated to cost 25 nanosec ((1/13782002-1/10267937)*10^9 = -24.83 ns). Using perf record on the tracepoint event, with a non-matching --filter expression, the overhead is much larger. Performance drops to 8.3 Mpps, cost 48 nanosec ((1/13782002-1/8312497)*10^9 = -47.74)) Having a separate tracepoint for err cases, which should be less frequent, allow running a continuous monitor for errors while not affecting the redirect forward performance (this have also been verified by measurements). Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com> Acked-by: NAlexei Starovoitov <ast@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jesper Dangaard Brouer 提交于
To make sense of the map index, the tracepoint user also need to know that map we are talking about. Supply the map pointer but only expose the map->id. The 'to_index' is renamed 'to_ifindex'. In the xdp_redirect_map case, this is the result of the devmap lookup. The map lookup key is exposed as map_index, which is needed to troubleshoot in case the lookup failed. The 'to_ifindex' is placed after 'err' to keep TP_STRUCT as common as possible. This also keeps the TP_STRUCT similar enough, that userspace can write a monitor program, that doesn't need to care about whether bpf_redirect or bpf_redirect_map were used. Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jesper Dangaard Brouer 提交于
Supplying the action argument XDP_REDIRECT to the tracepoint xdp_redirect is redundant as it is only called in-case this action was specified. Remove the argument, but keep "act" member of the tracepoint struct and populate it with XDP_REDIRECT. This makes it easier to write a common bpf_prog processing events. Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Florian Westphal 提交于
There appears to be no need to use rtnl, addrlabel entries are refcounted and add/delete is serialized by the addrlabel table spinlock. Signed-off-by: NFlorian Westphal <fw@strlen.de> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 29 8月, 2017 20 次提交
-
-
由 David Howells 提交于
Allow a client call that failed on network error to be retried, provided that the Tx queue still holds DATA packet 1. This allows an operation to be submitted to another server or another address for the same server without having to repackage and re-encrypt the data so far processed. Two new functions are provided: (1) rxrpc_kernel_check_call() - This is used to find out the completion state of a call to guess whether it can be retried and whether it should be retried. (2) rxrpc_kernel_retry_call() - Disconnect the call from its current connection, reset the state and submit it as a new client call to a new address. The new address need not match the previous address. A call may be retried even if all the data hasn't been loaded into it yet; a partially constructed will be retained at the same point it was at when an error condition was detected. msg_data_left() can be used to find out how much data was packaged before the error occurred. Signed-off-by: NDavid Howells <dhowells@redhat.com>
-
由 David Howells 提交于
Add a callback to rxrpc_kernel_send_data() so that a kernel service can get a notification that the AF_RXRPC call has transitioned out the Tx phase and is now waiting for a reply or a final ACK. This is called from AF_RXRPC with the call state lock held so the notification is guaranteed to come before any reply is passed back. Further, modify the AFS filesystem to make use of this so that we don't have to change the afs_call state before sending the last bit of data. Signed-off-by: NDavid Howells <dhowells@redhat.com>
-
由 David Howells 提交于
Remove indentation from some blank lines. Signed-off-by: NDavid Howells <dhowells@redhat.com>
-
由 David Howells 提交于
call->error is stored as 0 or a negative error code. Don't negate this value (ie. make it positive) before returning it from a kernel function (though it should still be negated before passing to userspace through a control message). Signed-off-by: NDavid Howells <dhowells@redhat.com>
-
由 David Howells 提交于
Fix IPv6 support in AF_RXRPC in the following ways: (1) When extracting the address from a received IPv4 packet, if the local transport socket is open for IPv6 then fill out the sockaddr_rxrpc struct for an IPv4-mapped-to-IPv6 AF_INET6 transport address instead of an AF_INET one. (2) When sending CHALLENGE or RESPONSE packets, the transport length needs to be set from the sockaddr_rxrpc::transport_len field rather than sizeof() on the IPv4 transport address. (3) When processing an IPv4 ICMP packet received by an IPv6 socket, set up the address correctly before searching for the affected peer. Signed-off-by: NDavid Howells <dhowells@redhat.com>
-
由 David Howells 提交于
When an XDR-encoded Kerberos 5 ticket is added as an rxrpc-type key, the expiry time should be drawn from the k5 part of the token union (which was what was filled in), rather than the kad part of the union. Reported-by: NArnd Bergmann <arnd@arndb.de> Signed-off-by: NDavid Howells <dhowells@redhat.com>
-
由 Baolin Wang 提交于
Since the 'expiry' variable of 'struct key_preparsed_payload' has been changed to 'time64_t' type, which is year 2038 safe on 32bits system. In net/rxrpc subsystem, we need convert 'u32' type to 'time64_t' type when copying ticket expires time to 'prep->expiry', then this patch introduces two helper functions to help convert 'u32' to 'time64_t' type. This patch also uses ktime_get_real_seconds() to get current time instead of get_seconds() which is not year 2038 safe on 32bits system. Signed-off-by: NBaolin Wang <baolin.wang@linaro.org> Signed-off-by: NDavid Howells <dhowells@redhat.com>
-
由 Samuel Mendoza-Jonas 提交于
Make use of the ndo_vlan_rx_{add,kill}_vid callbacks to have the NCSI stack process new VLAN tags and configure the channel VLAN filter appropriately. Several VLAN tags can be set and a "Set VLAN Filter" packet must be sent for each one, meaning the ncsi_dev_state_config_svf state must be repeated. An internal list of VLAN tags is maintained, and compared against the current channel's ncsi_channel_filter in order to keep track within the state. VLAN filters are removed in a similar manner, with the introduction of the ncsi_dev_state_config_clear_vids state. The maximum number of VLAN tag filters is determined by the "Get Capabilities" response from the channel. Signed-off-by: NSamuel Mendoza-Jonas <sam@mendozajonas.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Samuel Mendoza-Jonas 提交于
Signed-off-by: NSamuel Mendoza-Jonas <sam@mendozajonas.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Greg Kroah-Hartman 提交于
It's time to get rid of IRDA. It's long been broken, and no one seems to use it anymore. So move it to staging and after a while, we can delete it from there. To start, move the network irda core from net/irda to drivers/staging/irda/net/ Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Gao Feng 提交于
The commit 520ac30f ("net_sched: drop packets after root qdisc lock is released) made a big change of tc for performance. But there are some points which are not changed in SFQ enqueue operation. 1. Fail to find the SFQ hash slot; 2. When the queue is full; Now use qdisc_drop instead free skb directly. Signed-off-by: NGao Feng <gfree.wind@vip.163.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Dexuan Cui 提交于
Hyper-V Sockets (hv_sock) supplies a byte-stream based communication mechanism between the host and the guest. It uses VMBus ringbuffer as the transportation layer. With hv_sock, applications between the host (Windows 10, Windows Server 2016 or newer) and the guest can talk with each other using the traditional socket APIs. More info about Hyper-V Sockets is available here: "Make your own integration services": https://docs.microsoft.com/en-us/virtualization/hyper-v-on-windows/user-guide/make-integration-service The patch implements the necessary support in Linux guest by introducing a new vsock transport for AF_VSOCK. Signed-off-by: NDexuan Cui <decui@microsoft.com> Cc: K. Y. Srinivasan <kys@microsoft.com> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Stephen Hemminger <sthemmin@microsoft.com> Cc: Andy King <acking@vmware.com> Cc: Dmitry Torokhov <dtor@vmware.com> Cc: George Zhang <georgezhang@vmware.com> Cc: Jorgen Hansen <jhansen@vmware.com> Cc: Reilly Grant <grantr@vmware.com> Cc: Asias He <asias@redhat.com> Cc: Stefan Hajnoczi <stefanha@redhat.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Cathy Avery <cavery@redhat.com> Cc: Rolf Neugebauer <rolf.neugebauer@docker.com> Cc: Marcelo Cerri <marcelo.cerri@canonical.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 David Ahern 提交于
Twice patches trying to constify inet{6}_protocol have been reverted: 39294c3d ("Revert "ipv6: constify inet6_protocol structures"") to revert 3a3a4e30 and then 03157937 ("Revert "ipv4: make net_protocol const"") to revert aa8db499. Add a comment that the structures can not be const because the early_demux field can change based on a sysctl. Signed-off-by: NDavid Ahern <dsahern@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 William Tu 提交于
Similar to gre, vxlan, geneve, ipip tunnels, allow ERSPAN tunnels to operate in 'collect metadata' mode. bpf_skb_[gs]et_tunnel_key() helpers can make use of it right away. OVS can use it as well in the future. Signed-off-by: NWilliam Tu <u9012063@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 William Tu 提交于
The patch refactors the gre_fb_xmit function, by creating prepare_fb_xmit function for later ERSPAN collect_md mode patch. Signed-off-by: NWilliam Tu <u9012063@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 David Ahern 提交于
This reverts commit aa8db499. Early demux structs can not be made const. Doing so results in: [ 84.967355] BUG: unable to handle kernel paging request at ffffffff81684b10 [ 84.969272] IP: proc_configure_early_demux+0x1e/0x3d [ 84.970544] PGD 1a0a067 [ 84.970546] P4D 1a0a067 [ 84.971212] PUD 1a0b063 [ 84.971733] PMD 80000000016001e1 [ 84.972669] Oops: 0003 [#1] SMP [ 84.973065] Modules linked in: ip6table_filter ip6_tables veth vrf [ 84.973833] CPU: 0 PID: 955 Comm: sysctl Not tainted 4.13.0-rc6+ #22 [ 84.974612] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014 [ 84.975855] task: ffff88003854ce00 task.stack: ffffc900005a4000 [ 84.976580] RIP: 0010:proc_configure_early_demux+0x1e/0x3d [ 84.977253] RSP: 0018:ffffc900005a7dd0 EFLAGS: 00010246 [ 84.977891] RAX: ffffffff81684b10 RBX: 0000000000000001 RCX: 0000000000000000 [ 84.978759] RDX: 0000000000000000 RSI: 0000000000000006 RDI: 0000000000000000 [ 84.979628] RBP: ffffc900005a7dd0 R08: 0000000000000000 R09: 0000000000000000 [ 84.980501] R10: 0000000000000001 R11: 0000000000000008 R12: 0000000000000001 [ 84.981373] R13: ffffffffffffffea R14: ffffffff81a9b4c0 R15: 0000000000000002 [ 84.982249] FS: 00007feb237b7700(0000) GS:ffff88003fc00000(0000) knlGS:0000000000000000 [ 84.983231] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 84.983941] CR2: ffffffff81684b10 CR3: 0000000038492000 CR4: 00000000000406f0 [ 84.984817] Call Trace: [ 84.985133] proc_tcp_early_demux+0x29/0x30 I think this is the second time such a patch has been reverted. Cc: Bhumika Goyal <bhumirks@gmail.com> Signed-off-by: NDavid Ahern <dsahern@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Bhumika Goyal 提交于
Make this const as it is either used during a copy operation or passed to a const argument of the function rhltable_init Signed-off-by: NBhumika Goyal <bhumirks@gmail.com> Acked-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Bhumika Goyal 提交于
Make these const as they are only passed to a const argument of the function inet_add_protocol. Signed-off-by: NBhumika Goyal <bhumirks@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Bhumika Goyal 提交于
Make this const as it is only passed to a const argument of the function ebt_register_table. Signed-off-by: NBhumika Goyal <bhumirks@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 John Fastabend 提交于
SOCKMAP uses strparser code (compiled with Kconfig option CONFIG_STREAM_PARSER) to run the parser BPF program. Without this config option set sockmap wont be compiled. However, at the moment the only way to pull in the strparser code is to enable KCM. To resolve this create a BPF specific config option to pull only the strparser piece in that sockmap needs. This also allows folks who want to use BPF/syscall/maps but don't need sockmap to easily opt out. Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com> Acked-by: NAlexei Starovoitov <ast@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 26 8月, 2017 3 次提交
-
-
由 Eric Dumazet 提交于
syszkaller got a hang in tcp stack, related to a bug in tcp_sendpage_locked() root@syzkaller:~# cat /proc/3059/stack [<ffffffff83de926c>] __lock_sock+0x1dc/0x2f0 [<ffffffff83de9473>] lock_sock_nested+0xf3/0x110 [<ffffffff8408ce01>] tcp_sendmsg+0x21/0x50 [<ffffffff84163b6f>] inet_sendmsg+0x11f/0x5e0 [<ffffffff83dd8eea>] sock_sendmsg+0xca/0x110 [<ffffffff83dd9547>] kernel_sendmsg+0x47/0x60 [<ffffffff83de35dc>] sock_no_sendpage+0x1cc/0x280 [<ffffffff8408916b>] tcp_sendpage_locked+0x10b/0x160 [<ffffffff84089203>] tcp_sendpage+0x43/0x60 [<ffffffff841641da>] inet_sendpage+0x1aa/0x660 [<ffffffff83dd4fcd>] kernel_sendpage+0x8d/0xe0 [<ffffffff83dd50ac>] sock_sendpage+0x8c/0xc0 [<ffffffff81b63300>] pipe_to_sendpage+0x290/0x3b0 [<ffffffff81b67243>] __splice_from_pipe+0x343/0x750 [<ffffffff81b6a459>] splice_from_pipe+0x1e9/0x330 [<ffffffff81b6a5e0>] generic_splice_sendpage+0x40/0x50 [<ffffffff81b6b1d7>] SyS_splice+0x7b7/0x1610 [<ffffffff84d77a01>] entry_SYSCALL_64_fastpath+0x1f/0xbe Fixes: 306b13eb ("proto_ops: Add locked held versions of sendmsg and sendpage") Signed-off-by: NEric Dumazet <edumazet@google.com> Reported-by: NDmitry Vyukov <dvyukov@google.com> Cc: Tom Herbert <tom@quantonium.net> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 WANG Cong 提交于
It is ugly to hide a u32-filter-specific pointer inside Qdisc, this breaks the TC layers: 1. Qdisc is a generic representation, should not have any specific data of any type 2. Qdisc layer is above filter layer, should only save filters in the list of struct tcf_proto. This pointer is used as the head of the chain of u32 hash tables, that is struct tc_u_hnode, because u32 filter is very special, it allows to create multiple hash tables within one qdisc and across multiple u32 filters. Instead of using this ugly pointer, we can just save it in a global hash table key'ed by (dev ifindex, qdisc handle), therefore we can still treat it as a per qdisc basis data structure conceptually. Of course, because of network namespaces, this key is not unique at all, but it is fine as we already have a pointer to Qdisc in struct tc_u_common, we can just compare the pointers when collision. And this only affects slow paths, has no impact to fast path, thanks to the pointer ->tp_c. Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Jiri Pirko <jiri@resnulli.us> Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com> Acked-by: NJiri Pirko <jiri@mellanox.com> Acked-by: NJamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 WANG Cong 提交于
For TC classes, their ->get() and ->put() are always paired, and the reference counting is completely useless, because: 1) For class modification and dumping paths, we already hold RTNL lock, so all of these ->get(),->change(),->put() are atomic. 2) For filter bindiing/unbinding, we use other reference counter than this one, and they should have RTNL lock too. 3) For ->qlen_notify(), it is special because it is called on ->enqueue() path, but we already hold qdisc tree lock there, and we hold this tree lock when graft or delete the class too, so it should not be gone or changed until we release the tree lock. Therefore, this patch removes ->get() and ->put(), but: 1) Adds a new ->find() to find the pointer to a class by classid, no refcnt. 2) Move the original class destroy upon the last refcnt into ->delete(), right after releasing tree lock. This is fine because the class is already removed from hash when holding the lock. For those who also use ->put() as ->unbind(), just rename them to reflect this change. Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com> Acked-by: NJiri Pirko <jiri@mellanox.com> Acked-by: NJamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-