- 25 4月, 2017 2 次提交
-
-
由 Wei Wang 提交于
Middlebox firewall issues can potentially cause server's data being blackholed after a successful 3WHS using TFO. Following are the related reports from Apple: https://www.nanog.org/sites/default/files/Paasch_Network_Support.pdf Slide 31 identifies an issue where the client ACK to the server's data sent during a TFO'd handshake is dropped. C ---> syn-data ---> S C <--- syn/ack ----- S C (accept & write) C <---- data ------- S C ----- ACK -> X S [retry and timeout] https://www.ietf.org/proceedings/94/slides/slides-94-tcpm-13.pdf Slide 5 shows a similar situation that the server's data gets dropped after 3WHS. C ---- syn-data ---> S C <--- syn/ack ----- S C ---- ack --------> S S (accept & write) C? X <- data ------ S [retry and timeout] This is the worst failure b/c the client can not detect such behavior to mitigate the situation (such as disabling TFO). Failing to proceed, the application (e.g., SSL library) may simply timeout and retry with TFO again, and the process repeats indefinitely. The proposed solution is to disable active TFO globally under the following circumstances: 1. client side TFO socket detects out of order FIN 2. client side TFO socket receives out of order RST We disable active side TFO globally for 1hr at first. Then if it happens again, we disable it for 2h, then 4h, 8h, ... And we reset the timeout to 1hr if a client side TFO sockets not opened on loopback has successfully received data segs from server. And we examine this condition during close(). The rational behind it is that when such firewall issue happens, application running on the client should eventually close the socket as it is not able to get the data it is expecting. Or application running on the server should close the socket as it is not able to receive any response from client. In both cases, out of order FIN or RST will get received on the client given that the firewall will not block them as no data are in those frames. And we want to disable active TFO globally as it helps if the middle box is very close to the client and most of the connections are likely to fail. Also, add a debug sysctl: tcp_fastopen_blackhole_detect_timeout_sec: the initial timeout to use when firewall blackhole issue happens. This can be set and read. When setting it to 0, it means to disable the active disable logic. Signed-off-by: NWei Wang <weiwan@google.com> Acked-by: NYuchung Cheng <ycheng@google.com> Acked-by: NNeal Cardwell <ncardwell@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 David Ahern 提交于
systemd-sysctl is triggering a suspicious RCU usage message when net.ipv4.tcp_early_demux or net.ipv4.udp_early_demux is changed via a sysctl config file: [ 33.896184] =============================== [ 33.899558] [ ERR: suspicious RCU usage. ] [ 33.900624] 4.11.0-rc7+ #104 Not tainted [ 33.901698] ------------------------------- [ 33.903059] /home/dsa/kernel-2.git/net/ipv4/sysctl_net_ipv4.c:305 suspicious rcu_dereference_check() usage! [ 33.905724] other info that might help us debug this: [ 33.907656] rcu_scheduler_active = 2, debug_locks = 0 [ 33.909288] 1 lock held by systemd-sysctl/143: [ 33.910373] #0: (sb_writers#5){.+.+.+}, at: [<ffffffff8123a370>] file_start_write+0x45/0x48 [ 33.912407] stack backtrace: [ 33.914018] CPU: 0 PID: 143 Comm: systemd-sysctl Not tainted 4.11.0-rc7+ #104 [ 33.915631] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014 [ 33.917870] Call Trace: [ 33.918431] dump_stack+0x81/0xb6 [ 33.919241] lockdep_rcu_suspicious+0x10f/0x118 [ 33.920263] proc_configure_early_demux+0x65/0x10a [ 33.921391] proc_udp_early_demux+0x3a/0x41 add rcu locking to proc_configure_early_demux. Fixes: dddb64bc ("net: Add sysctl to toggle early demux for tcp and udp") Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 22 4月, 2017 1 次提交
-
-
由 Craig Gallek 提交于
This feature allows the administrator to set an fwmark for packets traversing a tunnel. This allows the use of independent routing tables for tunneled packets without the use of iptables. There is no concept of per-packet routing decisions through IPv4 tunnels, so this implementation does not need to work with per-packet route lookups as the v6 implementation may (with IP6_TNL_F_USE_ORIG_FWMARK). Further, since the v4 tunnel ioctls share datastructures (which can not be trivially modified) with the kernel's internal tunnel configuration structures, the mark attribute must be stored in the tunnel structure itself and passed as a parameter when creating or changing tunnel attributes. Signed-off-by: NCraig Gallek <kraig@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 21 4月, 2017 3 次提交
-
-
由 Chema Gonzalez 提交于
Signed-off-by: NChema Gonzalez <chemag@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
When using TCP FastOpen for an active session, we send one wakeup event from tcp_finish_connect(), right before the data eventually contained in the received SYNACK is queued to sk->sk_receive_queue. This means that depending on machine load or luck, poll() users might receive POLLOUT events instead of POLLIN|POLLOUT To fix this, we need to move the call to sk->sk_state_change() after the (optional) call to tcp_rcv_fastopen_synack() Signed-off-by: NEric Dumazet <edumazet@google.com> Acked-by: NYuchung Cheng <ycheng@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
When a RST packet is processed, we send two wakeup events to interested polling users. First one by a sk->sk_error_report(sk) from tcp_reset(), followed by a sk->sk_state_change(sk) from tcp_done(). Depending on machine load and luck, poll() can either return POLLERR, or POLLIN|POLLOUT|POLLERR|POLLHUP (this happens on 99 % of the cases) This is probably fine, but we can avoid the confusion by reordering things so that we have more TCP fields updated before the first wakeup. This might even allow us to remove some barriers we added in the past. Signed-off-by: NEric Dumazet <edumazet@google.com> Acked-by: NSoheil Hassas Yeganeh <soheil@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 19 4月, 2017 1 次提交
-
-
由 Ilan Tayari 提交于
If esp*_offload module is loaded, outbound packets take the GSO code path, being encapsulated at layer 3, but encrypted in layer 2. validate_xmit_xfrm calls esp*_xmit for that. esp*_xmit was wrongfully detecting these packets as going through hardware crypto offload, while in fact they should be encrypted in software, causing plaintext leakage to the network, and also dropping at the receiver side. Perform the encryption in esp*_xmit, if the SA doesn't have a hardware offload_handle. Also, align esp6 code to esp4 logic. Fixes: fca11ebd ("esp4: Reorganize esp_output") Fixes: 383d0350 ("esp6: Reorganize esp_output") Signed-off-by: NIlan Tayari <ilant@mellanox.com> Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>
-
- 18 4月, 2017 3 次提交
-
-
由 David Ahern 提交于
Add netlink_ext_ack arg to rtnl_doit_func. Pass extack arg to nlmsg_parse for doit functions that call it directly. This is the first step to using extended error reporting in rtnetlink. >From here individual subsystems can be updated to set netlink_ext_ack as needed. Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Willem de Bruijn 提交于
Syzkaller reported a use-after-free in ip_recv_error at line info->ipi_ifindex = skb->dev->ifindex; This function is called on dequeue from the error queue, at which point the device pointer may no longer be valid. Save ifindex on enqueue in __skb_complete_tx_timestamp, when the pointer is valid or NULL. Store it in temporary storage skb->cb. It is safe to reference skb->dev here, as called from device drivers or dev_queue_xmit. The exception is when called from tcp_ack_tstamp; in that case it is NULL and ifindex is set to 0 (invalid). Do not return a pktinfo cmsg if ifindex is 0. This maintains the current behavior of not returning a cmsg if skb->dev was NULL. On dequeue, the ipv4 path will cast from sock_exterr_skb to in_pktinfo. Both have ifindex as their first element, so no explicit conversion is needed. This is by design, introduced in commit 0b922b7a ("net: original ingress device index in PKTINFO"). For ipv6 ip6_datagram_support_cmsg converts to in6_pktinfo. Fixes: 829ae9d6 ("net-timestamp: allow reading recv cmsg on errqueue with origin tstamp") Reported-by: NAndrey Konovalov <andreyknvl@google.com> Signed-off-by: NWillem de Bruijn <willemb@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 WANG Cong 提交于
Similar to commit 87e9f031 ("ipv4: fix a potential deadlock in mcast getsockopt() path"), there is a deadlock scenario for IP_ROUTER_ALERT too: CPU0 CPU1 ---- ---- lock(rtnl_mutex); lock(sk_lock-AF_INET); lock(rtnl_mutex); lock(sk_lock-AF_INET); Fix this by always locking RTNL first on all setsockopt() paths. Note, after this patch ip_ra_lock is no longer needed either. Reported-by: NDmitry Vyukov <dvyukov@google.com> Tested-by: NAndrey Konovalov <andreyknvl@google.com> Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 14 4月, 2017 10 次提交
-
-
由 Steffen Klassert 提交于
On IPsec hardware offloading, we already get a secpath with valid state attached when the packet enters the GRO handlers. So check for hardware offload and skip the state lookup in this case. Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>
-
由 Ilan Tayari 提交于
Both esp4 and esp6 used to assume that the SKB payload is encrypted and therefore the inner_network and inner_transport offsets are not relevant. When doing crypto offload in the NIC, this is no longer the case and the NIC driver needs these offsets so it can do TX TCP checksum offloading. This patch sets the inner_network and inner_transport members of the SKB, as well as encapsulation, to reflect the actual positions of these headers, and removes them only once encryption is done on the payload. Signed-off-by: NIlan Tayari <ilant@mellanox.com> Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>
-
由 Steffen Klassert 提交于
We need a fallback algorithm for crypto offloading to a NIC. This is because packets can be rerouted to other NICs that don't support crypto offloading. The fallback is going to be implemented at layer2 where we know the final output device but can't handle asynchronous returns fron the crypto layer. Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>
-
由 Steffen Klassert 提交于
This patch extends the xfrm_type by an encap function pointer and implements esp4_gso_encap and esp6_gso_encap. These functions doing the basic esp encapsulation for a GSO packet. In case the GSO packet needs to be segmented in software, we add gso_segment functions. This codepath is going to be used on esp hardware offloads. Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>
-
由 Steffen Klassert 提交于
We need a fallback for ESP at layer 2, so split esp_output into generic functions that can be used at layer 3 and layer 2 and use them in esp_output. We also add esp_xmit which is used for the layer 2 fallback. Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>
-
由 Steffen Klassert 提交于
This patch adds all the bits that are needed to do IPsec hardware offload for IPsec states and ESP packets. We add xfrmdev_ops to the net_device. xfrmdev_ops has function pointers that are needed to manage the xfrm states in the hardware and to do a per packet offloading decision. Joint work with: Ilan Tayari <ilant@mellanox.com> Guy Shapiro <guysh@mellanox.com> Yossi Kuperman <yossiku@mellanox.com> Signed-off-by: NGuy Shapiro <guysh@mellanox.com> Signed-off-by: NIlan Tayari <ilant@mellanox.com> Signed-off-by: NYossi Kuperman <yossiku@mellanox.com> Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>
-
由 Steffen Klassert 提交于
This patch adds a gso_segment and xmit callback for the xfrm_mode and implement these functions for tunnel and transport mode. Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>
-
由 Gao Feng 提交于
Current codes invoke wrongly nf_ct_netns_get in the destroy routine, it should use nf_ct_netns_put, not nf_ct_netns_get. It could cause some modules could not be unloaded. Fixes: ecb2421b ("netfilter: add and use nf_ct_netns_get/put") Signed-off-by: NGao Feng <fgao@ikuai8.com> Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
-
由 Johannes Berg 提交于
Pass the new extended ACK reporting struct to all of the generic netlink parsing functions. For now, pass NULL in almost all callers (except for some in the core.) Signed-off-by: NJohannes Berg <johannes.berg@intel.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Gao Feng 提交于
1. Don't get the metric RTAX_ADVMSS of dst. There are two reasons. 1) Its caller dst_metric_advmss has already invoke dst_metric_advmss before invoke default_advmss. 2) The ipv4_default_advmss is used to get the default mss, it should not try to get the metric like ip6_default_advmss. 2. Use sizeof(tcphdr)+sizeof(iphdr) instead of literal 40. 3. Define one new macro IPV4_MAX_PMTU instead of 65535 according to RFC 2675, section 5.1. Signed-off-by: NGao Feng <fgao@ikuai8.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 10 4月, 2017 1 次提交
-
-
由 Eric Dumazet 提交于
In the (very unlikely) case a passive socket becomes a listener, we do not want to duplicate its saved SYN headers. This would lead to double frees, use after free, and please hackers and various fuzzers Tested: 0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 +0 setsockopt(3, IPPROTO_TCP, TCP_SAVE_SYN, [1], 4) = 0 +0 fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0 +0 bind(3, ..., ...) = 0 +0 listen(3, 5) = 0 +0 < S 0:0(0) win 32972 <mss 1460,nop,wscale 7> +0 > S. 0:0(0) ack 1 <...> +.1 < . 1:1(0) ack 1 win 257 +0 accept(3, ..., ...) = 4 +0 connect(4, AF_UNSPEC, ...) = 0 +0 close(3) = 0 +0 bind(4, ..., ...) = 0 +0 listen(4, 5) = 0 +0 < S 0:0(0) win 32972 <mss 1460,nop,wscale 7> +0 > S. 0:0(0) ack 1 <...> +.1 < . 1:1(0) ack 1 win 257 Fixes: cd8ae852 ("tcp: provide SYN headers for passive connections") Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 08 4月, 2017 2 次提交
-
-
由 Gao Feng 提交于
Because TCP_MIB_OUTRSTS is an important count, so always increase it whatever send it successfully or not. Now move the increment of TCP_MIB_OUTRSTS to the top of tcp_send_active_reset to make sure it is increased always even though fail to alloc skb. Signed-off-by: NGao Feng <fgao@ikuai8.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Yuchung Cheng 提交于
The recent extension of F-RTO 89fe18e4 ("tcp: extend F-RTO to catch more spurious timeouts") interacts badly with certain broken middle-boxes. These broken boxes modify and falsely raise the receive window on the ACKs. During a timeout induced recovery, F-RTO would send new data packets to probe if the timeout is false or not. Since the receive window is falsely raised, the receiver would silently drop these F-RTO packets. The recovery would take N (exponentially backoff) timeouts to repair N packet losses. A TCP performance killer. Due to this unfortunate situation, this patch removes this extension to revert F-RTO back to the RFC specification. Fixes: 89fe18e4 ("tcp: extend F-RTO to catch more spurious timeouts") Signed-off-by: NYuchung Cheng <ycheng@google.com> Signed-off-by: NNeal Cardwell <ncardwell@google.com> Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 07 4月, 2017 2 次提交
-
-
由 Florian Larysch 提交于
inet_rtm_getroute synthesizes a skeletal ICMP skb, which is passed to ip_route_input when iif is given. If a multipath route is present for the designated destination, fib_multipath_hash ends up being called with that skb. However, as that skb contains no information beyond the protocol type, the calculated hash does not match the one we would see for a real packet. There is currently no way to fix this for layer 4 hashing, as RTM_GETROUTE doesn't have the necessary information to create layer 4 headers. To fix this for layer 3 hashing, set appropriate saddr/daddrs in the skb and also change the protocol to UDP to avoid special treatment for ICMP. Signed-off-by: NFlorian Larysch <fl@n621.de> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Florian Larysch 提交于
inet_rtm_getroute synthesizes a skeletal ICMP skb, which is passed to ip_route_input when iif is given. If a multipath route is present for the designated destination, ip_multipath_icmp_hash ends up being called, which uses the source/destination addresses within the skb to calculate a hash. However, those are not set in the synthetic skb, causing it to return an arbitrary and incorrect result. Instead, use UDP, which gets no such special treatment. Signed-off-by: NFlorian Larysch <fl@n621.de> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 06 4月, 2017 2 次提交
-
-
由 Yuchung Cheng 提交于
Currently the reordering SNMP counters only increase if a connection sees a higher degree then it has previously seen. It ignores if the reordering degree is not greater than the default system threshold. This significantly under-counts the number of reordering events and falsely convey that reordering is rare on the network. This patch properly and faithfully records the number of reordering events detected by the TCP stack, just like the comment says "this exciting event is worth to be remembered". Note that even so TCP still under-estimate the actual reordering events because TCP requires TS options or certain packet sequences to detect reordering (i.e. ACKing never-retransmitted sequence in recovery or disordered state). Signed-off-by: NYuchung Cheng <ycheng@google.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NNeal Cardwell <ncardwell@google.com> Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Yuchung Cheng 提交于
The lost retransmit SNMP stat is under-counting retransmission that uses segment offloading. This patch fixes that so all retransmission related SNMP counters are consistent. Fixes: 10d3be56 ("tcp-tso: do not split TSO packets at retransmit time") Signed-off-by: NYuchung Cheng <ycheng@google.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NNeal Cardwell <ncardwell@google.com> Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 05 4月, 2017 1 次提交
-
-
由 Gao Feng 提交于
Define one new macro TCP_MAX_WSCALE instead of literal number '14', and use U16_MAX instead of 65535 as the max value of TCP window. There is another minor change, use rounddown(space, mss) instead of (space / mss) * mss; Signed-off-by: NGao Feng <fgao@ikuai8.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 04 4月, 2017 1 次提交
-
-
由 Marcelo Ricardo Leitner 提交于
Markus Trippelsdorf reported that after commit dcb17d22 ("tcp: warn on bogus MSS and try to amend it") the kernel started logging the warning for a NIC driver that doesn't even support GRO. It was diagnosed that it was possibly caused on connections that were using TCP Timestamps but some packets lacked the Timestamps option. As we reduce rcv_mss when timestamps are used, the lack of them would cause the packets to be bigger than expected, although this is a valid case. As this warning is more as a hint, getting a clean-cut on the threshold is probably not worth the execution time spent on it. This patch thus alleviates the false-positives with 2 quick checks: by accounting for the entire TCP option space and also checking against the interface MTU if it's available. These changes, specially the MTU one, might mask some real positives, though if they are really happening, it's possible that sooner or later it will be triggered anyway. Reported-by: NMarkus Trippelsdorf <markus@trippelsdorf.de> Cc: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 31 3月, 2017 1 次提交
-
-
由 Gao Feng 提交于
1. Move the "window = tp->rcv_wnd;" into the condition block without tp->rx_opt.rcv_wscale. Because it is unnecessary when enable wscale; 2. Use the macro ALIGN instead of two statements. The two statements are used to make window align to 1<<wscale. Use the ALIGN is more clearer. 3. Use the rounddown to make codes clearer. Signed-off-by: NGao Feng <fgao@ikuai8.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 29 3月, 2017 3 次提交
-
-
由 Andrew Lunn 提交于
There is an include loop between netdevice.h, dsa.h, devlink.h because of NETDEV_ALIGN, making it impossible to use devlink structures in dsa.h. Break this loop by taking dsa.h out of netdevice.h, add a forward declaration of dsa_switch_tree and netdev_set_default_ethtool_ops() function, which is what netdevice.h requires. No longer having dsa.h in netdevice.h means the includes in dsa.h no longer get included. This breaks a few other files which depend on these includes. Add these directly in the affected file. Signed-off-by: NAndrew Lunn <andrew@lunn.ch> Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 David Ahern 提交于
Send RTM_DELNETCONF notifications when a device is deleted. The message only needs the device index, so modify inet_netconf_fill_devconf to skip devconf references if it is NULL. Allows a userspace cache to remove entries as devices are deleted. Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 David Ahern 提交于
Refactor inet_netconf_notify_devconf to take the event as an input arg. Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 28 3月, 2017 1 次提交
-
-
由 Mark Rutland 提交于
Our chosen ic_dev may be anywhere in our list of ic_devs, and we may free it before attempting to close others. When we compare d->dev and ic_dev->dev, we're potentially dereferencing memory returned to the allocator. This causes KASAN to scream for each subsequent ic_dev we check. As there's a 1-1 mapping between ic_devs and netdevs, we can instead compare d and ic_dev directly, which implicitly handles the !ic_dev case, and avoids the use-after-free. The ic_dev pointer may be stale, but we will not dereference it. Original splat: [ 6.487446] ================================================================== [ 6.494693] BUG: KASAN: use-after-free in ic_close_devs+0xc4/0x154 at addr ffff800367efa708 [ 6.503013] Read of size 8 by task swapper/0/1 [ 6.507452] CPU: 5 PID: 1 Comm: swapper/0 Not tainted 4.11.0-rc3-00002-gda42158 #8 [ 6.514993] Hardware name: AppliedMicro Mustang/Mustang, BIOS 3.05.05-beta_rc Jan 27 2016 [ 6.523138] Call trace: [ 6.525590] [<ffff200008094778>] dump_backtrace+0x0/0x570 [ 6.530976] [<ffff200008094d08>] show_stack+0x20/0x30 [ 6.536017] [<ffff200008bee928>] dump_stack+0x120/0x188 [ 6.541231] [<ffff20000856d5e4>] kasan_object_err+0x24/0xa0 [ 6.546790] [<ffff20000856d924>] kasan_report_error+0x244/0x738 [ 6.552695] [<ffff20000856dfec>] __asan_report_load8_noabort+0x54/0x80 [ 6.559204] [<ffff20000aae86ac>] ic_close_devs+0xc4/0x154 [ 6.564590] [<ffff20000aaedbac>] ip_auto_config+0x2ed4/0x2f1c [ 6.570321] [<ffff200008084b04>] do_one_initcall+0xcc/0x370 [ 6.575882] [<ffff20000aa31de8>] kernel_init_freeable+0x5f8/0x6c4 [ 6.581959] [<ffff20000a16df00>] kernel_init+0x18/0x190 [ 6.587171] [<ffff200008084710>] ret_from_fork+0x10/0x40 [ 6.592468] Object at ffff800367efa700, in cache kmalloc-128 size: 128 [ 6.598969] Allocated: [ 6.601324] PID = 1 [ 6.603427] save_stack_trace_tsk+0x0/0x418 [ 6.607603] save_stack_trace+0x20/0x30 [ 6.611430] kasan_kmalloc+0xd8/0x188 [ 6.615087] ip_auto_config+0x8c4/0x2f1c [ 6.619002] do_one_initcall+0xcc/0x370 [ 6.622832] kernel_init_freeable+0x5f8/0x6c4 [ 6.627178] kernel_init+0x18/0x190 [ 6.630660] ret_from_fork+0x10/0x40 [ 6.634223] Freed: [ 6.636233] PID = 1 [ 6.638334] save_stack_trace_tsk+0x0/0x418 [ 6.642510] save_stack_trace+0x20/0x30 [ 6.646337] kasan_slab_free+0x88/0x178 [ 6.650167] kfree+0xb8/0x478 [ 6.653131] ic_close_devs+0x130/0x154 [ 6.656875] ip_auto_config+0x2ed4/0x2f1c [ 6.660875] do_one_initcall+0xcc/0x370 [ 6.664705] kernel_init_freeable+0x5f8/0x6c4 [ 6.669051] kernel_init+0x18/0x190 [ 6.672534] ret_from_fork+0x10/0x40 [ 6.676098] Memory state around the buggy address: [ 6.680880] ffff800367efa600: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 6.688078] ffff800367efa680: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 6.695276] >ffff800367efa700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 6.702469] ^ [ 6.705952] ffff800367efa780: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 6.713149] ffff800367efa800: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 6.720343] ================================================================== [ 6.727536] Disabling lock debugging due to kernel taint Signed-off-by: NMark Rutland <mark.rutland@arm.com> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: David S. Miller <davem@davemloft.net> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: James Morris <jmorris@namei.org> Cc: Patrick McHardy <kaber@trash.net> Cc: netdev@vger.kernel.org Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 27 3月, 2017 2 次提交
-
-
由 Gao Feng 提交于
In the commit 93557f53 ("netfilter: nf_conntrack: nf_conntrack snmp helper"), the snmp_helper is replaced by nf_nat_snmp_hook. So the snmp_helper is never registered. But it still tries to unregister the snmp_helper, it could cause the panic. Now remove the useless snmp_helper and the unregister call in the error handler. Fixes: 93557f53 ("netfilter: nf_conntrack: nf_conntrack snmp helper") Signed-off-by: NGao Feng <fgao@ikuai8.com> Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
-
由 Liping Zhang 提交于
Otherwise, another CPU may access the invalid pointer. For example: CPU0 CPU1 - rcu_read_lock(); - pfunc = _hook_; _hook_ = NULL; - mod unload - - pfunc(); // invalid, panic - rcu_read_unlock(); So we must call synchronize_rcu() to wait the rcu reader to finish. Also note, in nf_nat_snmp_basic_fini, synchronize_rcu() will be invoked by later nf_conntrack_helper_unregister, but I'm inclined to add a explicit synchronize_rcu after set the nf_nat_snmp_hook to NULL. Depend on such obscure assumptions is not a good idea. Last, in nfnetlink_cttimeout, we use kfree_rcu to free the time object, so in cttimeout_exit, invoking rcu_barrier() is not necessary at all, remove it too. Signed-off-by: NLiping Zhang <zlpnobody@gmail.com> Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
-
- 25 3月, 2017 3 次提交
-
-
由 Eric Dumazet 提交于
We got a report of yet another bug in ping http://www.openwall.com/lists/oss-security/2017/03/24/6 ->disconnect() is not called with socket lock held. Fix this by acquiring ping rwlock earlier. Thanks to Daniel, Alexander and Andrey for letting us know this problem. Fixes: c319b4d7 ("net: ipv4: add IPPROTO_ICMP socket kind") Signed-off-by: NEric Dumazet <edumazet@google.com> Reported-by: NDaniel Jiang <danieljiang0415@gmail.com> Reported-by: NSolar Designer <solar@openwall.com> Reported-by: NAndrey Konovalov <andreyknvl@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Alexander Duyck 提交于
While working on some recent busy poll changes we found that child sockets were being instantiated without NAPI ID being set. In our first attempt to fix it, it was suggested that we should just pull programming the NAPI ID into the function itself since all callers will need to have it set. In addition to the NAPI ID change I have dropped the code that was populating the Rx hash since it was actually being populated in tcp_get_cookie_sock. Reported-by: NSridhar Samudrala <sridhar.samudrala@intel.com> Signed-off-by: NAlexander Duyck <alexander.h.duyck@intel.com> Acked-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 subashab@codeaurora.org 提交于
Certain system process significant unconnected UDP workload. It would be preferrable to disable UDP early demux for those systems and enable it for TCP only. By disabling UDP demux, we see these slight gains on an ARM64 system- 782 -> 788Mbps unconnected single stream UDPv4 633 -> 654Mbps unconnected UDPv4 different sources The performance impact can change based on CPU architecure and cache sizes. There will not much difference seen if entire UDP hash table is in cache. Both sysctls are enabled by default to preserve existing behavior. v1->v2: Change function pointer instead of adding conditional as suggested by Stephen. v2->v3: Read once in callers to avoid issues due to compiler optimizations. Also update commit message with the tests. v3->v4: Store and use read once result instead of querying pointer again incorrectly. v4->v5: Refactor to avoid errors due to compilation with IPV6={m,n} Signed-off-by: NSubash Abhinov Kasiviswanathan <subashab@codeaurora.org> Suggested-by: NEric Dumazet <edumazet@google.com> Cc: Stephen Hemminger <stephen@networkplumber.org> Cc: Tom Herbert <tom@herbertland.com> Cc: David Miller <davem@davemloft.net> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 23 3月, 2017 1 次提交
-
-
由 Eric Dumazet 提交于
Dmitry reported a lockdep splat [1] (false positive) that we can fix by releasing the spinlock before calling icmp_send() from ip_expire() This is a false positive because sending an ICMP message can not possibly re-enter the IP frag engine. [1] [ INFO: possible circular locking dependency detected ] 4.10.0+ #29 Not tainted ------------------------------------------------------- modprobe/12392 is trying to acquire lock: (_xmit_ETHER#2){+.-...}, at: [<ffffffff837a8182>] spin_lock include/linux/spinlock.h:299 [inline] (_xmit_ETHER#2){+.-...}, at: [<ffffffff837a8182>] __netif_tx_lock include/linux/netdevice.h:3486 [inline] (_xmit_ETHER#2){+.-...}, at: [<ffffffff837a8182>] sch_direct_xmit+0x282/0x6d0 net/sched/sch_generic.c:180 but task is already holding lock: (&(&q->lock)->rlock){+.-...}, at: [<ffffffff8389a4d1>] spin_lock include/linux/spinlock.h:299 [inline] (&(&q->lock)->rlock){+.-...}, at: [<ffffffff8389a4d1>] ip_expire+0x51/0x6c0 net/ipv4/ip_fragment.c:201 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (&(&q->lock)->rlock){+.-...}: validate_chain kernel/locking/lockdep.c:2267 [inline] __lock_acquire+0x2149/0x3430 kernel/locking/lockdep.c:3340 lock_acquire+0x2a1/0x630 kernel/locking/lockdep.c:3755 __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline] _raw_spin_lock+0x33/0x50 kernel/locking/spinlock.c:151 spin_lock include/linux/spinlock.h:299 [inline] ip_defrag+0x3a2/0x4130 net/ipv4/ip_fragment.c:669 ip_check_defrag+0x4e3/0x8b0 net/ipv4/ip_fragment.c:713 packet_rcv_fanout+0x282/0x800 net/packet/af_packet.c:1459 deliver_skb net/core/dev.c:1834 [inline] dev_queue_xmit_nit+0x294/0xa90 net/core/dev.c:1890 xmit_one net/core/dev.c:2903 [inline] dev_hard_start_xmit+0x16b/0xab0 net/core/dev.c:2923 sch_direct_xmit+0x31f/0x6d0 net/sched/sch_generic.c:182 __dev_xmit_skb net/core/dev.c:3092 [inline] __dev_queue_xmit+0x13e5/0x1e60 net/core/dev.c:3358 dev_queue_xmit+0x17/0x20 net/core/dev.c:3423 neigh_resolve_output+0x6b9/0xb10 net/core/neighbour.c:1308 neigh_output include/net/neighbour.h:478 [inline] ip_finish_output2+0x8b8/0x15a0 net/ipv4/ip_output.c:228 ip_do_fragment+0x1d93/0x2720 net/ipv4/ip_output.c:672 ip_fragment.constprop.54+0x145/0x200 net/ipv4/ip_output.c:545 ip_finish_output+0x82d/0xe10 net/ipv4/ip_output.c:314 NF_HOOK_COND include/linux/netfilter.h:246 [inline] ip_output+0x1f0/0x7a0 net/ipv4/ip_output.c:404 dst_output include/net/dst.h:486 [inline] ip_local_out+0x95/0x170 net/ipv4/ip_output.c:124 ip_send_skb+0x3c/0xc0 net/ipv4/ip_output.c:1492 ip_push_pending_frames+0x64/0x80 net/ipv4/ip_output.c:1512 raw_sendmsg+0x26de/0x3a00 net/ipv4/raw.c:655 inet_sendmsg+0x164/0x5b0 net/ipv4/af_inet.c:761 sock_sendmsg_nosec net/socket.c:633 [inline] sock_sendmsg+0xca/0x110 net/socket.c:643 ___sys_sendmsg+0x4a3/0x9f0 net/socket.c:1985 __sys_sendmmsg+0x25c/0x750 net/socket.c:2075 SYSC_sendmmsg net/socket.c:2106 [inline] SyS_sendmmsg+0x35/0x60 net/socket.c:2101 do_syscall_64+0x2e8/0x930 arch/x86/entry/common.c:281 return_from_SYSCALL_64+0x0/0x7a -> #0 (_xmit_ETHER#2){+.-...}: check_prev_add kernel/locking/lockdep.c:1830 [inline] check_prevs_add+0xa8f/0x19f0 kernel/locking/lockdep.c:1940 validate_chain kernel/locking/lockdep.c:2267 [inline] __lock_acquire+0x2149/0x3430 kernel/locking/lockdep.c:3340 lock_acquire+0x2a1/0x630 kernel/locking/lockdep.c:3755 __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline] _raw_spin_lock+0x33/0x50 kernel/locking/spinlock.c:151 spin_lock include/linux/spinlock.h:299 [inline] __netif_tx_lock include/linux/netdevice.h:3486 [inline] sch_direct_xmit+0x282/0x6d0 net/sched/sch_generic.c:180 __dev_xmit_skb net/core/dev.c:3092 [inline] __dev_queue_xmit+0x13e5/0x1e60 net/core/dev.c:3358 dev_queue_xmit+0x17/0x20 net/core/dev.c:3423 neigh_hh_output include/net/neighbour.h:468 [inline] neigh_output include/net/neighbour.h:476 [inline] ip_finish_output2+0xf6c/0x15a0 net/ipv4/ip_output.c:228 ip_finish_output+0xa29/0xe10 net/ipv4/ip_output.c:316 NF_HOOK_COND include/linux/netfilter.h:246 [inline] ip_output+0x1f0/0x7a0 net/ipv4/ip_output.c:404 dst_output include/net/dst.h:486 [inline] ip_local_out+0x95/0x170 net/ipv4/ip_output.c:124 ip_send_skb+0x3c/0xc0 net/ipv4/ip_output.c:1492 ip_push_pending_frames+0x64/0x80 net/ipv4/ip_output.c:1512 icmp_push_reply+0x372/0x4d0 net/ipv4/icmp.c:394 icmp_send+0x156c/0x1c80 net/ipv4/icmp.c:754 ip_expire+0x40e/0x6c0 net/ipv4/ip_fragment.c:239 call_timer_fn+0x241/0x820 kernel/time/timer.c:1268 expire_timers kernel/time/timer.c:1307 [inline] __run_timers+0x960/0xcf0 kernel/time/timer.c:1601 run_timer_softirq+0x21/0x80 kernel/time/timer.c:1614 __do_softirq+0x31f/0xbe7 kernel/softirq.c:284 invoke_softirq kernel/softirq.c:364 [inline] irq_exit+0x1cc/0x200 kernel/softirq.c:405 exiting_irq arch/x86/include/asm/apic.h:657 [inline] smp_apic_timer_interrupt+0x76/0xa0 arch/x86/kernel/apic/apic.c:962 apic_timer_interrupt+0x93/0xa0 arch/x86/entry/entry_64.S:707 __read_once_size include/linux/compiler.h:254 [inline] atomic_read arch/x86/include/asm/atomic.h:26 [inline] rcu_dynticks_curr_cpu_in_eqs kernel/rcu/tree.c:350 [inline] __rcu_is_watching kernel/rcu/tree.c:1133 [inline] rcu_is_watching+0x83/0x110 kernel/rcu/tree.c:1147 rcu_read_lock_held+0x87/0xc0 kernel/rcu/update.c:293 radix_tree_deref_slot include/linux/radix-tree.h:238 [inline] filemap_map_pages+0x6d4/0x1570 mm/filemap.c:2335 do_fault_around mm/memory.c:3231 [inline] do_read_fault mm/memory.c:3265 [inline] do_fault+0xbd5/0x2080 mm/memory.c:3370 handle_pte_fault mm/memory.c:3600 [inline] __handle_mm_fault+0x1062/0x2cb0 mm/memory.c:3714 handle_mm_fault+0x1e2/0x480 mm/memory.c:3751 __do_page_fault+0x4f6/0xb60 arch/x86/mm/fault.c:1397 do_page_fault+0x54/0x70 arch/x86/mm/fault.c:1460 page_fault+0x28/0x30 arch/x86/entry/entry_64.S:1011 other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&(&q->lock)->rlock); lock(_xmit_ETHER#2); lock(&(&q->lock)->rlock); lock(_xmit_ETHER#2); *** DEADLOCK *** 10 locks held by modprobe/12392: #0: (&mm->mmap_sem){++++++}, at: [<ffffffff81329758>] __do_page_fault+0x2b8/0xb60 arch/x86/mm/fault.c:1336 #1: (rcu_read_lock){......}, at: [<ffffffff8188cab6>] filemap_map_pages+0x1e6/0x1570 mm/filemap.c:2324 #2: (&(ptlock_ptr(page))->rlock#2){+.+...}, at: [<ffffffff81984a78>] spin_lock include/linux/spinlock.h:299 [inline] #2: (&(ptlock_ptr(page))->rlock#2){+.+...}, at: [<ffffffff81984a78>] pte_alloc_one_map mm/memory.c:2944 [inline] #2: (&(ptlock_ptr(page))->rlock#2){+.+...}, at: [<ffffffff81984a78>] alloc_set_pte+0x13b8/0x1b90 mm/memory.c:3072 #3: (((&q->timer))){+.-...}, at: [<ffffffff81627e72>] lockdep_copy_map include/linux/lockdep.h:175 [inline] #3: (((&q->timer))){+.-...}, at: [<ffffffff81627e72>] call_timer_fn+0x1c2/0x820 kernel/time/timer.c:1258 #4: (&(&q->lock)->rlock){+.-...}, at: [<ffffffff8389a4d1>] spin_lock include/linux/spinlock.h:299 [inline] #4: (&(&q->lock)->rlock){+.-...}, at: [<ffffffff8389a4d1>] ip_expire+0x51/0x6c0 net/ipv4/ip_fragment.c:201 #5: (rcu_read_lock){......}, at: [<ffffffff8389a633>] ip_expire+0x1b3/0x6c0 net/ipv4/ip_fragment.c:216 #6: (slock-AF_INET){+.-...}, at: [<ffffffff839b3313>] spin_trylock include/linux/spinlock.h:309 [inline] #6: (slock-AF_INET){+.-...}, at: [<ffffffff839b3313>] icmp_xmit_lock net/ipv4/icmp.c:219 [inline] #6: (slock-AF_INET){+.-...}, at: [<ffffffff839b3313>] icmp_send+0x803/0x1c80 net/ipv4/icmp.c:681 #7: (rcu_read_lock_bh){......}, at: [<ffffffff838ab9a1>] ip_finish_output2+0x2c1/0x15a0 net/ipv4/ip_output.c:198 #8: (rcu_read_lock_bh){......}, at: [<ffffffff836d1dee>] __dev_queue_xmit+0x23e/0x1e60 net/core/dev.c:3324 #9: (dev->qdisc_running_key ?: &qdisc_running_key){+.....}, at: [<ffffffff836d3a27>] dev_queue_xmit+0x17/0x20 net/core/dev.c:3423 stack backtrace: CPU: 0 PID: 12392 Comm: modprobe Not tainted 4.10.0+ #29 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: <IRQ> __dump_stack lib/dump_stack.c:16 [inline] dump_stack+0x2ee/0x3ef lib/dump_stack.c:52 print_circular_bug+0x307/0x3b0 kernel/locking/lockdep.c:1204 check_prev_add kernel/locking/lockdep.c:1830 [inline] check_prevs_add+0xa8f/0x19f0 kernel/locking/lockdep.c:1940 validate_chain kernel/locking/lockdep.c:2267 [inline] __lock_acquire+0x2149/0x3430 kernel/locking/lockdep.c:3340 lock_acquire+0x2a1/0x630 kernel/locking/lockdep.c:3755 __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline] _raw_spin_lock+0x33/0x50 kernel/locking/spinlock.c:151 spin_lock include/linux/spinlock.h:299 [inline] __netif_tx_lock include/linux/netdevice.h:3486 [inline] sch_direct_xmit+0x282/0x6d0 net/sched/sch_generic.c:180 __dev_xmit_skb net/core/dev.c:3092 [inline] __dev_queue_xmit+0x13e5/0x1e60 net/core/dev.c:3358 dev_queue_xmit+0x17/0x20 net/core/dev.c:3423 neigh_hh_output include/net/neighbour.h:468 [inline] neigh_output include/net/neighbour.h:476 [inline] ip_finish_output2+0xf6c/0x15a0 net/ipv4/ip_output.c:228 ip_finish_output+0xa29/0xe10 net/ipv4/ip_output.c:316 NF_HOOK_COND include/linux/netfilter.h:246 [inline] ip_output+0x1f0/0x7a0 net/ipv4/ip_output.c:404 dst_output include/net/dst.h:486 [inline] ip_local_out+0x95/0x170 net/ipv4/ip_output.c:124 ip_send_skb+0x3c/0xc0 net/ipv4/ip_output.c:1492 ip_push_pending_frames+0x64/0x80 net/ipv4/ip_output.c:1512 icmp_push_reply+0x372/0x4d0 net/ipv4/icmp.c:394 icmp_send+0x156c/0x1c80 net/ipv4/icmp.c:754 ip_expire+0x40e/0x6c0 net/ipv4/ip_fragment.c:239 call_timer_fn+0x241/0x820 kernel/time/timer.c:1268 expire_timers kernel/time/timer.c:1307 [inline] __run_timers+0x960/0xcf0 kernel/time/timer.c:1601 run_timer_softirq+0x21/0x80 kernel/time/timer.c:1614 __do_softirq+0x31f/0xbe7 kernel/softirq.c:284 invoke_softirq kernel/softirq.c:364 [inline] irq_exit+0x1cc/0x200 kernel/softirq.c:405 exiting_irq arch/x86/include/asm/apic.h:657 [inline] smp_apic_timer_interrupt+0x76/0xa0 arch/x86/kernel/apic/apic.c:962 apic_timer_interrupt+0x93/0xa0 arch/x86/entry/entry_64.S:707 RIP: 0010:__read_once_size include/linux/compiler.h:254 [inline] RIP: 0010:atomic_read arch/x86/include/asm/atomic.h:26 [inline] RIP: 0010:rcu_dynticks_curr_cpu_in_eqs kernel/rcu/tree.c:350 [inline] RIP: 0010:__rcu_is_watching kernel/rcu/tree.c:1133 [inline] RIP: 0010:rcu_is_watching+0x83/0x110 kernel/rcu/tree.c:1147 RSP: 0000:ffff8801c391f120 EFLAGS: 00000a03 ORIG_RAX: ffffffffffffff10 RAX: dffffc0000000000 RBX: ffff8801c391f148 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 000055edd4374000 RDI: ffff8801dbe1ae0c RBP: ffff8801c391f1a0 R08: 0000000000000002 R09: 0000000000000000 R10: dffffc0000000000 R11: 0000000000000002 R12: 1ffff10038723e25 R13: ffff8801dbe1ae00 R14: ffff8801c391f680 R15: dffffc0000000000 </IRQ> rcu_read_lock_held+0x87/0xc0 kernel/rcu/update.c:293 radix_tree_deref_slot include/linux/radix-tree.h:238 [inline] filemap_map_pages+0x6d4/0x1570 mm/filemap.c:2335 do_fault_around mm/memory.c:3231 [inline] do_read_fault mm/memory.c:3265 [inline] do_fault+0xbd5/0x2080 mm/memory.c:3370 handle_pte_fault mm/memory.c:3600 [inline] __handle_mm_fault+0x1062/0x2cb0 mm/memory.c:3714 handle_mm_fault+0x1e2/0x480 mm/memory.c:3751 __do_page_fault+0x4f6/0xb60 arch/x86/mm/fault.c:1397 do_page_fault+0x54/0x70 arch/x86/mm/fault.c:1460 page_fault+0x28/0x30 arch/x86/entry/entry_64.S:1011 RIP: 0033:0x7f83172f2786 RSP: 002b:00007fffe859ae80 EFLAGS: 00010293 RAX: 000055edd4373040 RBX: 00007f83175111c8 RCX: 000055edd4373238 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00007f8317510970 RBP: 00007fffe859afd0 R08: 0000000000000009 R09: 0000000000000000 R10: 0000000000000064 R11: 0000000000000000 R12: 000055edd4373040 R13: 0000000000000000 R14: 00007fffe859afe8 R15: 0000000000000000 Signed-off-by: NEric Dumazet <edumazet@google.com> Reported-by: NDmitry Vyukov <dvyukov@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-