- 06 12月, 2018 4 次提交
-
-
由 Nikolay Aleksandrov 提交于
bridge's default hash_max was 512 which is rather conservative, now that we're using the generic rhashtable API which autoshrinks let's increase it to 4096 and move it to a define in br_private.h. Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Nikolay Aleksandrov 提交于
Now that the bridge multicast uses the generic rhashtable interface we can drop the hash_elasticity option as that is already done for us and it's hardcoded to a maximum of RHT_ELASTICITY (16 currently). Add a warning about the obsolete option when the hash_elasticity is set. Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Nikolay Aleksandrov 提交于
The bridge multicast code has been using a mix of RCU and RCU-bh flavors sometimes in questionable way. Since we've moved to rhashtable just use non-bh RCU everywhere. In addition this simplifies freeing of objects and allows us to remove some unnecessary callback functions. v3: new patch Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Nikolay Aleksandrov 提交于
The bridge multicast code currently uses a custom resizable hashtable which predates the generic rhashtable interface. It has many shortcomings compared and duplicates functionality that is presently available via the generic rhashtable, so this patch removes the custom rhashtable implementation in favor of the kernel's generic rhashtable. The hash maximum is kept and the rhashtable's size is used to do a loose check if it's reached in which case we revert to the old behaviour and disable further bridge multicast processing. Also now we can support any hash maximum, doesn't need to be a power of 2. v3: add non-rcu br_mdb_get variant and use it where multicast_lock is held to avoid RCU splat, drop hash_max function and just set it directly v2: handle when IGMP snooping is undefined, add br_mdb_init/uninit placeholders Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 05 12月, 2018 5 次提交
-
-
由 Eric Dumazet 提交于
TCP_NOTSENT_LOWAT socket option or sysctl was added in linux-3.12 as a step to enable bigger tcp sndbuf limits. It works reasonably well, but the following happens : Once the limit is reached, TCP stack generates an [E]POLLOUT event for every incoming ACK packet. This causes a high number of context switches. This patch implements the strategy David Miller added in sock_def_write_space() : - If TCP socket has a notsent_lowat constraint of X bytes, allow sendmsg() to fill up to X bytes, but send [E]POLLOUT only if number of notsent bytes is below X/2 This considerably reduces TCP_NOTSENT_LOWAT overhead, while allowing to keep the pipe full. Tested: 100 ms RTT netem testbed between A and B, 100 concurrent TCP_STREAM A:/# cat /proc/sys/net/ipv4/tcp_wmem 4096 262144 64000000 A:/# super_netperf 100 -H B -l 1000 -- -K bbr & A:/# grep TCP /proc/net/sockstat TCP: inuse 203 orphan 0 tw 19 alloc 414 mem 1364904 # This is about 54 MB of memory per flow :/ A:/# vmstat 5 5 procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 0 0 0 256220672 13532 694976 0 0 10 0 28 14 0 1 99 0 0 2 0 0 256320016 13532 698480 0 0 512 0 715901 5927 0 10 90 0 0 0 0 0 256197232 13532 700992 0 0 735 13 771161 5849 0 11 89 0 0 1 0 0 256233824 13532 703320 0 0 512 23 719650 6635 0 11 89 0 0 2 0 0 256226880 13532 705780 0 0 642 4 775650 6009 0 12 88 0 0 A:/# echo 2097152 >/proc/sys/net/ipv4/tcp_notsent_lowat A:/# grep TCP /proc/net/sockstat TCP: inuse 203 orphan 0 tw 19 alloc 414 mem 86411 # 3.5 MB per flow A:/# vmstat 5 5 # check that context switches have not inflated too much. procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 2 0 0 260386512 13592 662148 0 0 10 0 17 14 0 1 99 0 0 0 0 0 260519680 13592 604184 0 0 512 13 726843 12424 0 10 90 0 0 1 1 0 260435424 13592 598360 0 0 512 25 764645 12925 0 10 90 0 0 1 0 0 260855392 13592 578380 0 0 512 7 722943 13624 0 11 88 0 0 1 0 0 260445008 13592 601176 0 0 614 34 772288 14317 0 10 90 0 0 Signed-off-by: NEric Dumazet <edumazet@google.com> Acked-by: NSoheil Hassas Yeganeh <soheil@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Adi Nissim 提交于
It's possible to set a tunnel without a destination port. However, on dump(), a zero dst port is returned to user space even if it was not set, fix that. Note that so far it wasn't required, b/c key less tunnels were not supported and the UDP tunnels do require destination port. Signed-off-by: NAdi Nissim <adin@mellanox.com> Reviewed-by: NOz Shlomo <ozsh@mellanox.com> Acked-by: NJiri Pirko <jiri@mellanox.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Adi Nissim 提交于
Allow setting a tunnel without a tunnel key. This is required for tunneling protocols, such as GRE, that define the key as an optional field. Signed-off-by: NAdi Nissim <adin@mellanox.com> Acked-by: NOr Gerlitz <ogerlitz@mellanox.com> Reviewed-by: NOz Shlomo <ozsh@mellanox.com> Acked-by: NJiri Pirko <jiri@mellanox.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Ido Schimmel 提交于
Packets marked with 'offload_l3_fwd_mark' were already forwarded by a capable device and should not be forwarded again by the kernel. Therefore, have the kernel consume them. The check is performed in ip{,6}_forward_finish() in order to allow the kernel to process such packets in ip{,6}_forward() and generate required exceptions. For example, ICMP redirects. Signed-off-by: NIdo Schimmel <idosch@mellanox.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Ido Schimmel 提交于
Commit abf4bb6b ("skbuff: Add the offload_mr_fwd_mark field") added the 'offload_mr_fwd_mark' field to indicate that a packet has already undergone L3 multicast routing by a capable device. The field is used to prevent the kernel from forwarding a packet through a netdev through which the device has already forwarded the packet. Currently, no unicast packet is routed by both the device and the kernel, but this is about to change by subsequent patches and we need to be able to mark such packets, so that they will no be forwarded twice. Instead of adding yet another field to 'struct sk_buff', we can just rename 'offload_mr_fwd_mark' to 'offload_l3_fwd_mark', as a packet either has a multicast or a unicast destination IP. While at it, add a comment about both 'offload_fwd_mark' and 'offload_l3_fwd_mark'. Signed-off-by: NIdo Schimmel <idosch@mellanox.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 04 12月, 2018 6 次提交
-
-
由 Willem de Bruijn 提交于
With MSG_ZEROCOPY, each skb holds a reference to a struct ubuf_info. Release of its last reference triggers a completion notification. The TCP stack in tcp_sendmsg_locked holds an extra ref independent of the skbs, because it can build, send and free skbs within its loop, possibly reaching refcount zero and freeing the ubuf_info too soon. The UDP stack currently also takes this extra ref, but does not need it as all skbs are sent after return from __ip(6)_append_data. Avoid the extra refcount_inc and refcount_dec_and_test, and generally the sock_zerocopy_put in the common path, by passing the initial reference to the first skb. This approach is taken instead of initializing the refcount to 0, as that would generate error "refcount_t: increment on 0" on the next skb_zcopy_set. Changes v3 -> v4 - Move skb_zcopy_set below the only kfree_skb that might cause a premature uarg destroy before skb_zerocopy_put_abort - Move the entire skb_shinfo assignment block, to keep that cacheline access in one place Signed-off-by: NWillem de Bruijn <willemb@google.com> Acked-by: NPaolo Abeni <pabeni@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Willem de Bruijn 提交于
Extend zerocopy to udp sockets. Allow setting sockopt SO_ZEROCOPY and interpret flag MSG_ZEROCOPY. This patch was previously part of the zerocopy RFC patchsets. Zerocopy is not effective at small MTU. With segmentation offload building larger datagrams, the benefit of page flipping outweights the cost of generating a completion notification. tools/testing/selftests/net/msg_zerocopy.sh after applying follow-on test patch and making skb_orphan_frags_rx same as skb_orphan_frags: ipv4 udp -t 1 tx=191312 (11938 MB) txc=0 zc=n rx=191312 (11938 MB) ipv4 udp -z -t 1 tx=304507 (19002 MB) txc=304507 zc=y rx=304507 (19002 MB) ok ipv6 udp -t 1 tx=174485 (10888 MB) txc=0 zc=n rx=174485 (10888 MB) ipv6 udp -z -t 1 tx=294801 (18396 MB) txc=294801 zc=y rx=294801 (18396 MB) ok Changes v1 -> v2 - Fixup reverse christmas tree violation v2 -> v3 - Split refcount avoidance optimization into separate patch - Fix refcount leak on error in fragmented case (thanks to Paolo Abeni for pointing this one out!) - Fix refcount inc on zero - Test sock_flag SOCK_ZEROCOPY directly in __ip_append_data. This is needed since commit 5cf4a853 ("tcp: really ignore MSG_ZEROCOPY if no SO_ZEROCOPY") did the same for tcp. Signed-off-by: NWillem de Bruijn <willemb@google.com> Acked-by: NPaolo Abeni <pabeni@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Bartosz Golaszewski 提交于
We already have of_get_nvmem_mac_address() but some non-DT systems want to read the MAC address from NVMEM too. Implement a generalized routine that takes struct device as argument. Signed-off-by: NBartosz Golaszewski <bgolaszewski@baylibre.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Alexis Bauvin 提交于
Existing functions to retreive the l3mdev of a device did not walk the master chain to find the upper master. This patch adds a function to find the l3mdev, even indirect through e.g. a bridge: +----------+ | | | vrf-blue | | | +----+-----+ | | +----+-----+ | | | br-blue | | | +----+-----+ | | +----+-----+ | | | eth0 | | | +----------+ This will properly resolve the l3mdev of eth0 to vrf-blue. Signed-off-by: NAlexis Bauvin <abauvin@scaleway.com> Reviewed-by: NAmine Kherbouche <akherbouche@scaleway.com> Reviewed-by: NDavid Ahern <dsahern@gmail.com> Tested-by: NAmine Kherbouche <akherbouche@scaleway.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Alexis Bauvin 提交于
UDP tunnel sockets are always opened unbound to a specific device. This patch allow the socket to be bound on a custom device, which incidentally makes UDP tunnels VRF-aware if binding to an l3mdev. Signed-off-by: NAlexis Bauvin <abauvin@scaleway.com> Reviewed-by: NAmine Kherbouche <akherbouche@scaleway.com> Tested-by: NAmine Kherbouche <akherbouche@scaleway.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Shalom Toledo 提交于
Many drivers load the device's firmware image during the initialization flow either from the flash or from the disk. Currently this option is not controlled by the user and the driver decides from where to load the firmware image. 'fw_load_policy' gives the ability to control this option which allows the user to choose between different loading policies supported by the driver. This parameter can be useful while testing and/or debugging the device. For example, testing a firmware bug fix. Signed-off-by: NShalom Toledo <shalomt@mellanox.com> Reviewed-by: NJiri Pirko <jiri@mellanox.com> Signed-off-by: NIdo Schimmel <idosch@mellanox.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 01 12月, 2018 7 次提交
-
-
由 Ido Schimmel 提交于
Currently, the function only works for the bridge device itself, but subsequent patches will need to be able to query the PVID of a given bridge port, so extend the function. Signed-off-by: NIdo Schimmel <idosch@mellanox.com> Reviewed-by: NPetr Machata <petrm@mellanox.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jakub Kicinski 提交于
Standard kernel compilation produces the following warning: net/core/rtnetlink.c: In function ‘rtnl_newlink’: net/core/rtnetlink.c:3232:1: warning: the frame size of 1288 bytes is larger than 1024 bytes [-Wframe-larger-than=] } ^ This should not really be an issue, as rtnl_newlink() stack is generally quite shallow. Fix the warning by allocating attributes with kmalloc() in a wrapper and passing it down to rtnl_newlink(), avoiding complexities on error paths. Alternatively we could kmalloc() some structure within rtnl_newlink(), slave attributes look like a good candidate. In practice it adds to already rather high complexity and length of the function. Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: NDavid Ahern <dsahern@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jakub Kicinski 提交于
rtnl_newlink() used to create VLAs based on link kind. Since commit ccf8dbcd ("rtnetlink: Remove VLA usage") statically sized array is created on the stack, so there is no more use for a separate code block that used to be the VLA's live range. While at it christmas tree the variables. Note that there is a goto-based retry so to be on the safe side the variables can no longer be initialized in place. It doesn't seem to matter, logically, but why make the code harder to read.. Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: NDavid Ahern <dsahern@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
Most linux hosts never setup TCP MD5 keys. We can avoid a cache line miss (accessing tp->md5ig_info) on RX and TX using a jump label. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
In case GRO is not as efficient as it should be or disabled, we might have a user thread trapped in __release_sock() while softirq handler flood packets up to the point we have to drop. This patch balances work done from user thread and softirq, to give more chances to __release_sock() to complete its work before new packets are added the the backlog. This also helps if we receive many ACK packets, since GRO does not aggregate them. This patch brings ~60% throughput increase on a receiver without GRO, but the spectacular gain is really on 1000x release_sock() latency reduction I have measured. Signed-off-by: NEric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Acked-by: NNeal Cardwell <ncardwell@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
Neal pointed out that non sack flows might suffer from ACK compression added in the following patch ("tcp: implement coalescing on backlog queue") Instead of tweaking tcp_add_backlog() we can take into account how many ACK were coalesced, this information will be available in skb_shinfo(skb)->gso_segs Signed-off-by: NEric Dumazet <edumazet@google.com> Acked-by: NNeal Cardwell <ncardwell@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Geneviève Bastien 提交于
Trace events are already present for the receive entry points, to indicate how the reception entered the stack. This patch adds the corresponding exit trace events that will bound the reception such that all events occurring between the entry and the exit can be considered as part of the reception context. This greatly helps for dependency and root cause analyses. Without this, it is not possible with tracepoint instrumentation to determine whether a sched_wakeup event following a netif_receive_skb event is the result of the packet reception or a simple coincidence after further processing by the thread. It is possible using other mechanisms like kretprobes, but considering the "entry" points are already present, it would be good to add the matching exit events. In addition to linking packets with wakeups, the entry/exit event pair can also be used to perform network stack latency analyses. Signed-off-by: NGeneviève Bastien <gbastien@versatic.net> CC: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> CC: Steven Rostedt <rostedt@goodmis.org> CC: Ingo Molnar <mingo@redhat.com> CC: David S. Miller <davem@davemloft.net> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> (tracing side) Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 30 11月, 2018 2 次提交
-
-
由 Cong Wang 提交于
Cc: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
We can remove the loop and conditional branches and compute wscale efficiently thanks to ilog2() Signed-off-by: NEric Dumazet <edumazet@google.com> Acked-by: NSoheil Hassas Yeganeh <soheil@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 29 11月, 2018 1 次提交
-
-
由 John Fastabend 提交于
This adds a BPF SK_MSG program helper so that we can pop data from a msg. We use this to pop metadata from a previous push data call. Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
-
- 28 11月, 2018 14 次提交
-
-
由 Taehee Yoo 提交于
There is no expression deactivation call from the rule replacement path, hence, chain counter is not decremented. A few steps to reproduce the problem: %nft add table ip filter %nft add chain ip filter c1 %nft add chain ip filter c1 %nft add rule ip filter c1 jump c2 %nft replace rule ip filter c1 handle 3 accept %nft flush ruleset <jump c2> expression means immediate NFT_JUMP to chain c2. Reference count of chain c2 is increased when the rule is added. When rule is deleted or replaced, the reference counter of c2 should be decreased via nft_rule_expr_deactivate() which calls nft_immediate_deactivate(). Splat looks like: [ 214.396453] WARNING: CPU: 1 PID: 21 at net/netfilter/nf_tables_api.c:1432 nf_tables_chain_destroy.isra.38+0x2f9/0x3a0 [nf_tables] [ 214.398983] Modules linked in: nf_tables nfnetlink [ 214.398983] CPU: 1 PID: 21 Comm: kworker/1:1 Not tainted 4.20.0-rc2+ #44 [ 214.398983] Workqueue: events nf_tables_trans_destroy_work [nf_tables] [ 214.398983] RIP: 0010:nf_tables_chain_destroy.isra.38+0x2f9/0x3a0 [nf_tables] [ 214.398983] Code: 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 80 3c 02 00 0f 85 8e 00 00 00 48 8b 7b 58 e8 e1 2c 4e c6 48 89 df e8 d9 2c 4e c6 eb 9a <0f> 0b eb 96 0f 0b e9 7e fe ff ff e8 a7 7e 4e c6 e9 a4 fe ff ff e8 [ 214.398983] RSP: 0018:ffff8881152874e8 EFLAGS: 00010202 [ 214.398983] RAX: 0000000000000001 RBX: ffff88810ef9fc28 RCX: ffff8881152876f0 [ 214.398983] RDX: dffffc0000000000 RSI: 1ffff11022a50ede RDI: ffff88810ef9fc78 [ 214.398983] RBP: 1ffff11022a50e9d R08: 0000000080000000 R09: 0000000000000000 [ 214.398983] R10: 0000000000000000 R11: 0000000000000000 R12: 1ffff11022a50eba [ 214.398983] R13: ffff888114446e08 R14: ffff8881152876f0 R15: ffffed1022a50ed6 [ 214.398983] FS: 0000000000000000(0000) GS:ffff888116400000(0000) knlGS:0000000000000000 [ 214.398983] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 214.398983] CR2: 00007fab9bb5f868 CR3: 000000012aa16000 CR4: 00000000001006e0 [ 214.398983] Call Trace: [ 214.398983] ? nf_tables_table_destroy.isra.37+0x100/0x100 [nf_tables] [ 214.398983] ? __kasan_slab_free+0x145/0x180 [ 214.398983] ? nf_tables_trans_destroy_work+0x439/0x830 [nf_tables] [ 214.398983] ? kfree+0xdb/0x280 [ 214.398983] nf_tables_trans_destroy_work+0x5f5/0x830 [nf_tables] [ ... ] Fixes: bb7b40ae ("netfilter: nf_tables: bogus EBUSY in chain deletions") Reported by: Christoph Anton Mitterer <calestyo@scientia.net> Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=914505 Link: https://bugzilla.kernel.org/show_bug.cgi?id=201791Signed-off-by: NTaehee Yoo <ap420073@gmail.com> Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
-
由 David Ahern 提交于
Randy reported when CONFIG_PROC_FS is not enabled: ld: net/ipv4/af_inet.o: in function `inet_init': af_inet.c:(.init.text+0x42d): undefined reference to `raw_init' Fix by moving the endif up to the end of the proc entries Fixes: 6897445f ("net: provide a sysctl raw_l3mdev_accept for raw socket lookup with VRFs") Reported-by: NRandy Dunlap <rdunlap@infradead.org> Cc: Mike Manning <mmanning@vyatta.att-mail.com> Signed-off-by: NDavid Ahern <dsahern@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
Only one caller needs to pull TCP headers, so lets move __skb_pull() to the caller side. Signed-off-by: NEric Dumazet <edumazet@google.com> Acked-by: NYuchung Cheng <ycheng@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Vijay Khemka 提交于
This patch adds OEM Mellanox commands and response handling. It also defines OEM Get MAC Address handler to get and configure the device. ncsi_oem_gma_handler_mlx: This handler send NCSI mellanox command for getting mac address. ncsi_rsp_handler_oem_mlx: This handles response received for all mellanox OEM commands. ncsi_rsp_handler_oem_mlx_gma: This handles get mac address response and set it to device. Signed-off-by: NVijay Khemka <vijaykhemka@fb.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jon Maloy 提交于
We see the following lockdep warning: [ 2284.078521] ====================================================== [ 2284.078604] WARNING: possible circular locking dependency detected [ 2284.078604] 4.19.0+ #42 Tainted: G E [ 2284.078604] ------------------------------------------------------ [ 2284.078604] rmmod/254 is trying to acquire lock: [ 2284.078604] 00000000acd94e28 ((&n->timer)#2){+.-.}, at: del_timer_sync+0x5/0xa0 [ 2284.078604] [ 2284.078604] but task is already holding lock: [ 2284.078604] 00000000f997afc0 (&(&tn->node_list_lock)->rlock){+.-.}, at: tipc_node_stop+0xac/0x190 [tipc] [ 2284.078604] [ 2284.078604] which lock already depends on the new lock. [ 2284.078604] [ 2284.078604] [ 2284.078604] the existing dependency chain (in reverse order) is: [ 2284.078604] [ 2284.078604] -> #1 (&(&tn->node_list_lock)->rlock){+.-.}: [ 2284.078604] tipc_node_timeout+0x20a/0x330 [tipc] [ 2284.078604] call_timer_fn+0xa1/0x280 [ 2284.078604] run_timer_softirq+0x1f2/0x4d0 [ 2284.078604] __do_softirq+0xfc/0x413 [ 2284.078604] irq_exit+0xb5/0xc0 [ 2284.078604] smp_apic_timer_interrupt+0xac/0x210 [ 2284.078604] apic_timer_interrupt+0xf/0x20 [ 2284.078604] default_idle+0x1c/0x140 [ 2284.078604] do_idle+0x1bc/0x280 [ 2284.078604] cpu_startup_entry+0x19/0x20 [ 2284.078604] start_secondary+0x187/0x1c0 [ 2284.078604] secondary_startup_64+0xa4/0xb0 [ 2284.078604] [ 2284.078604] -> #0 ((&n->timer)#2){+.-.}: [ 2284.078604] del_timer_sync+0x34/0xa0 [ 2284.078604] tipc_node_delete+0x1a/0x40 [tipc] [ 2284.078604] tipc_node_stop+0xcb/0x190 [tipc] [ 2284.078604] tipc_net_stop+0x154/0x170 [tipc] [ 2284.078604] tipc_exit_net+0x16/0x30 [tipc] [ 2284.078604] ops_exit_list.isra.8+0x36/0x70 [ 2284.078604] unregister_pernet_operations+0x87/0xd0 [ 2284.078604] unregister_pernet_subsys+0x1d/0x30 [ 2284.078604] tipc_exit+0x11/0x6f2 [tipc] [ 2284.078604] __x64_sys_delete_module+0x1df/0x240 [ 2284.078604] do_syscall_64+0x66/0x460 [ 2284.078604] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 2284.078604] [ 2284.078604] other info that might help us debug this: [ 2284.078604] [ 2284.078604] Possible unsafe locking scenario: [ 2284.078604] [ 2284.078604] CPU0 CPU1 [ 2284.078604] ---- ---- [ 2284.078604] lock(&(&tn->node_list_lock)->rlock); [ 2284.078604] lock((&n->timer)#2); [ 2284.078604] lock(&(&tn->node_list_lock)->rlock); [ 2284.078604] lock((&n->timer)#2); [ 2284.078604] [ 2284.078604] *** DEADLOCK *** [ 2284.078604] [ 2284.078604] 3 locks held by rmmod/254: [ 2284.078604] #0: 000000003368be9b (pernet_ops_rwsem){+.+.}, at: unregister_pernet_subsys+0x15/0x30 [ 2284.078604] #1: 0000000046ed9c86 (rtnl_mutex){+.+.}, at: tipc_net_stop+0x144/0x170 [tipc] [ 2284.078604] #2: 00000000f997afc0 (&(&tn->node_list_lock)->rlock){+.-.}, at: tipc_node_stop+0xac/0x19 [...} The reason is that the node timer handler sometimes needs to delete a node which has been disconnected for too long. To do this, it grabs the lock 'node_list_lock', which may at the same time be held by the generic node cleanup function, tipc_node_stop(), during module removal. Since the latter is calling del_timer_sync() inside the same lock, we have a potential deadlock. We fix this letting the timer cleanup function use spin_trylock() instead of just spin_lock(), and when it fails to grab the lock it just returns so that the timer handler can terminate its execution. This is safe to do, since tipc_node_stop() anyway is about to delete both the timer and the node instance. Fixes: 6a939f36 ("tipc: Auto removal of peer down node instance") Acked-by: NYing Xue <ying.xue@windriver.com> Signed-off-by: NJon Maloy <jon.maloy@ericsson.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Nicolas Dichtel 提交于
Like the previous patch, the goal is to ease to convert nsids from one netns to another netns. A new attribute (NETNSA_CURRENT_NSID) is added to the kernel answer when NETNSA_TARGET_NSID is provided, thus the user can easily convert nsids. Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com> Reviewed-by: NDavid Ahern <dsahern@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Nicolas Dichtel 提交于
Combined with NETNSA_TARGET_NSID, it enables to "translate" a nsid from one netns to a nsid of another netns. This is useful when using NETLINK_F_LISTEN_ALL_NSID because it helps the user to interpret a nsid received from an other netns. Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com> Reviewed-by: NDavid Ahern <dsahern@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Nicolas Dichtel 提交于
Like it was done for link and address, add the ability to perform get/dump in another netns by specifying a target nsid attribute. Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com> Reviewed-by: NDavid Ahern <dsahern@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Nicolas Dichtel 提交于
This is a preparatory work. To avoid having to much arguments for the function rtnl_net_fill(), a new structure is defined. Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com> Reviewed-by: NDavid Ahern <dsahern@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Nicolas Dichtel 提交于
This argument is not used anymore. Fixes: cab3c8ec ("netns: always provide the id to rtnl_net_fill()") Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com> Reviewed-by: NDavid Ahern <dsahern@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Xin Long 提交于
I changed to count sk_wmem_alloc by skb truesize instead of 1 to fix the sk_wmem_alloc leak caused by later truesize's change in xfrm in Commit 02968ccf ("sctp: count sk_wmem_alloc by skb truesize in sctp_packet_transmit"). But I should have also increased sk_wmem_alloc when head->truesize is increased in sctp_packet_gso_append() as xfrm does. Otherwise, sctp gso packet will cause sk_wmem_alloc underflow. Fixes: 02968ccf ("sctp: count sk_wmem_alloc by skb truesize in sctp_packet_transmit") Signed-off-by: NXin Long <lucien.xin@gmail.com> Acked-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Nikolay Aleksandrov 提交于
Now that we have at least one bool option, we can export all of the supported bool options via optmask when dumping them. v2: new patch Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com> Reviewed-by: NAndrew Lunn <andrew@lunn.ch> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Nikolay Aleksandrov 提交于
Use the new boolopt API to add an option which disables learning from link-local packets. The default is kept as before and learning is enabled. This is a simple map from a boolopt bit to a bridge private flag that is tested before learning. v2: pass NULL for extack via sysfs Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com> Reviewed-by: NAndrew Lunn <andrew@lunn.ch> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Nikolay Aleksandrov 提交于
We have been adding many new bridge options, a big number of which are boolean but still take up netlink attribute ids and waste space in the skb. Recently we discussed learning from link-local packets[1] and decided yet another new boolean option will be needed, thus introducing this API to save some bridge nl space. The API supports changing the value of multiple boolean options at once via the br_boolopt_multi struct which has an optmask (which options to set, bit per opt) and optval (options' new values). Future boolean options will only be added to the br_boolopt_id enum and then will have to be handled in br_boolopt_toggle/get. The API will automatically add the ability to change and export them via netlink, sysfs can use the single boolopt function versions to do the same. The behaviour with failing/succeeding is the same as with normal netlink option changing. If an option requires mapping to internal kernel flag or needs special configuration to be enabled then it should be handled in br_boolopt_toggle. It should also be able to retrieve an option's current state via br_boolopt_get. v2: WARN_ON() on unsupported option as that shouldn't be possible and also will help catch people who add new options without handling them for both set and get. Pass down extack so if an option desires it could set it on error and be more user-friendly. [1] https://www.spinics.net/lists/netdev/msg532698.htmlSigned-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com> Reviewed-by: NAndrew Lunn <andrew@lunn.ch> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 27 11月, 2018 1 次提交
-
-
由 Taehee Yoo 提交于
All lists that reach the tree_nodes_free() function have both zero counter and true dead flag. The reason for this is that lists to be release are selected by nf_conncount_gc_list() which already decrements the list counter and sets on the dead flag. Therefore, this if statement in tree_nodes_free() is unnecessary and wrong. Fixes: 31568ec0 ("netfilter: nf_conncount: fix list_del corruption in conn_free") Signed-off-by: NTaehee Yoo <ap420073@gmail.com> Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
-