- 14 11月, 2009 2 次提交
-
-
由 Eric Dumazet 提交于
While investigating for network latencies, I found inet_getid() was a contention point for some workloads, as inet_peer_idlock is shared by all inet_getid() users regardless of peers. One way to fix this is to make ip_id_count an atomic_t instead of __u16, and use atomic_add_return(). In order to keep sizeof(struct inet_peer) = 64 on 64bit arches tcp_ts_stamp is also converted to __u32 instead of "unsigned long". Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 William Allen Simpson 提交于
Define two symbols needed in both kernel and user space. Remove old (somewhat incorrect) kernel variant that wasn't used in most cases. Default should apply to both RMSS and SMSS (RFC2581). Replace numeric constants with defined symbols. Stand-alone patch, originally developed for TCPCT. Signed-off-by: William.Allen.Simpson@gmail.com Acked-by: NEric Dumazet <eric.dumazet@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 29 10月, 2009 1 次提交
-
-
由 Gilad Ben-Yossef 提交于
We need tcp_parse_options to be aware of dst_entry to take into account per dst_entry TCP options settings Signed-off-by: NGilad Ben-Yossef <gilad@codefidence.com> Sigend-off-by: NOri Finkelman <ori@comsleep.com> Sigend-off-by: NYony Amit <yony@comsleep.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 19 10月, 2009 1 次提交
-
-
由 Eric Dumazet 提交于
In order to have better cache layouts of struct sock (separate zones for rx/tx paths), we need this preliminary patch. Goal is to transfert fields used at lookup time in the first read-mostly cache line (inside struct sock_common) and move sk_refcnt to a separate cache line (only written by rx path) This patch adds inet_ prefix to daddr, rcv_saddr, dport, num, saddr, sport and id fields. This allows a future patch to define these fields as macros, like sk_refcnt, without name clashes. Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 13 10月, 2009 1 次提交
-
-
由 Eric Dumazet 提交于
Storing the mask (size - 1) instead of the size allows fast path to be a bit faster. Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 15 9月, 2009 1 次提交
-
-
由 Ilpo Järvinen 提交于
It was once upon time so that snd_sthresh was a 16-bit quantity. ...That has not been true for long period of time. I run across some ancient compares which still seem to trust such legacy. Put all that magic into a single place, I hopefully found all of them. Compile tested, though linking of allyesconfig is ridiculous nowadays it seems. Signed-off-by: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 03 9月, 2009 1 次提交
-
-
由 Wu Fengguang 提交于
This fixed a lockdep warning which appeared when doing stress memory tests over NFS: inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage. page reclaim => nfs_writepage => tcp_sendmsg => lock sk_lock mount_root => nfs_root_data => tcp_close => lock sk_lock => tcp_send_fin => alloc_skb_fclone => page reclaim David raised a concern that if the allocation fails in tcp_send_fin(), and it's GFP_ATOMIC, we are going to yield() (which sleeps) and loop endlessly waiting for the allocation to succeed. But fact is, the original GFP_KERNEL also sleeps. GFP_ATOMIC+yield() looks weird, but it is no worse the implicit sleep inside GFP_KERNEL. Both could loop endlessly under memory pressure. CC: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> CC: David S. Miller <davem@davemloft.net> CC: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: NWu Fengguang <fengguang.wu@intel.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 02 9月, 2009 2 次提交
-
-
由 Stephen Hemminger 提交于
The function block inet_connect_sock_af_ops contains no data make it constant. Signed-off-by: NStephen Hemminger <shemminger@vyatta.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Stephen Hemminger 提交于
Signed-off-by: NStephen Hemminger <shemminger@vyatta.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 01 9月, 2009 2 次提交
-
-
由 Damian Lukowski 提交于
Here, an ICMP host/network unreachable message, whose payload fits to TCP's SND.UNA, is taken as an indication that the RTO retransmission has not been lost due to congestion, but because of a route failure somewhere along the path. With true congestion, a router won't trigger such a message and the patched TCP will operate as standard TCP. This patch reverts one RTO backoff, if an ICMP host/network unreachable message, whose payload fits to TCP's SND.UNA, arrives. Based on the new RTO, the retransmission timer is reset to reflect the remaining time, or - if the revert clocked out the timer - a retransmission is sent out immediately. Backoffs are only reverted, if TCP is in RTO loss recovery, i.e. if there have been retransmissions and reversible backoffs, already. Changes from v2: 1) Renaming of skb in tcp_v4_err() moved to another patch. 2) Reintroduced tcp_bound_rto() and __tcp_set_rto(). 3) Fixed code comments. Signed-off-by: NDamian Lukowski <damian@tvk.rwth-aachen.de> Acked-by: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Damian Lukowski 提交于
This supplementary patch renames skb to icmp_skb in tcp_v4_err() in order to disambiguate from another sk_buff variable, which will be introduced in a separate patch. Signed-off-by: NDamian Lukowski <damian@tvk.rwth-aachen.de> Acked-by: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 20 7月, 2009 2 次提交
-
-
由 John Dykstra 提交于
When the TCP connection handshake completes on the passive side, a variety of state must be set up in the "child" sock, including the key if MD5 authentication is being used. Fix TCP for both address families to label the key with the peer's destination address, rather than the address from the listening sock, which is usually the wildcard. Reported-by: NStephen Hemminger <shemminger@vyatta.com> Signed-off-by: NJohn Dykstra <john.dykstra1@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 John Dykstra 提交于
Fix MD5 signature checking so that an IPv4 active open to an IPv6 socket can succeed. In particular, use the correct address family's signature generation function for the SYN/ACK. Reported-by: NStephen Hemminger <shemminger@vyatta.com> Signed-off-by: NJohn Dykstra <john.dykstra1@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 03 6月, 2009 2 次提交
-
-
由 Eric Dumazet 提交于
Define three accessors to get/set dst attached to a skb struct dst_entry *skb_dst(const struct sk_buff *skb) void skb_dst_set(struct sk_buff *skb, struct dst_entry *dst) void skb_dst_drop(struct sk_buff *skb) This one should replace occurrences of : dst_release(skb->dst) skb->dst = NULL; Delete skb->dst field Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
Define skb_rtable(const struct sk_buff *skb) accessor to get rtable from skb Delete skb->rtable field Setting rtable is not allowed, just set dst instead as rtable is an alias. Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 06 5月, 2009 1 次提交
-
-
由 Shan Wei 提交于
Signed-off-by: Shan Wei<shanwei@cn.fujitsu.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 27 4月, 2009 1 次提交
-
-
由 Herbert Xu 提交于
On a brand new GRO skb, we cannot call ip_hdr since the header may lie in the non-linear area. This patch adds the helper skb_gro_network_header to handle this. Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 28 3月, 2009 1 次提交
-
-
由 Paul Moore 提交于
The current placement of the security_inet_conn_request() hooks do not allow individual LSMs to override the IP options of the connection's request_sock. This is a problem as both SELinux and Smack have the ability to use labeled networking protocols which make use of IP options to carry security attributes and the inability to set the IP options at the start of the TCP handshake is problematic. This patch moves the IPv4 security_inet_conn_request() hooks past the code where the request_sock's IP options are set/reset so that the LSM can safely manipulate the IP options as needed. This patch intentionally does not change the related IPv6 hooks as IPv6 based labeling protocols which use IPv6 options are not currently implemented, once they are we will have a better idea of the correct placement for the IPv6 hooks. Signed-off-by: NPaul Moore <paul.moore@hp.com> Acked-by: NDavid S. Miller <davem@davemloft.net> Signed-off-by: NJames Morris <jmorris@namei.org>
-
- 12 3月, 2009 1 次提交
-
-
由 Eric Dumazet 提交于
Some systems send SYN packets with apparently wrong RFC1323 timestamp option values [timestamp tsval=0 tsecr=0]. It might be for security reasons (http://www.secuobs.com/plugs/25220.shtml ) Linux TCP stack ignores this option and sends back a SYN+ACK packet without timestamp option, thus many TCP flows cannot use timestamps and lose some benefit of RFC1323. Other operating systems seem to not care about initial tsval value, and let tcp flows to negotiate timestamp option. Signed-off-by: NEric Dumazet <dada1@cosmosbay.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 03 3月, 2009 1 次提交
-
-
由 Eric W. Biederman 提交于
To remove the possibility of packets flying around when network devices are being cleaned up use reisger_pernet_subsys instead of register_pernet_device. Signed-off-by: NEric W. Biederman <ebiederm@aristanetworks.com> Acked-by: NDenis V. Lunev <den@openvz.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 23 2月, 2009 1 次提交
-
-
由 Eric W. Biederman 提交于
To remove the possibility of packets flying around when network devices are being cleaned up use reisger_pernet_subsys instead of register_pernet_device. Signed-off-by: NEric W. Biederman <ebiederm@aristanetworks.com> Acked-by: NDenis V. Lunev <den@openvz.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 30 1月, 2009 1 次提交
-
-
由 Herbert Xu 提交于
Unfortunately simplicity isn't always the best. The fraginfo interface turned out to be suboptimal. The problem was quite obvious. For every packet, we have to copy the headers from the frags structure into skb->head, even though for 99% of the packets this part is immediately thrown away after the merge. LRO didn't have this problem because it directly read the headers from the frags structure. This patch attempts to address this by creating an interface that allows GRO to access the headers in the first frag without having to copy it. Because all drivers that use frags place the headers in the first frag this optimisation should be enough. Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 07 1月, 2009 1 次提交
-
-
由 Dan Williams 提交于
Use the general-purpose channel allocation provided by dmaengine. Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NDan Williams <dan.j.williams@intel.com>
-
- 30 12月, 2008 1 次提交
-
-
由 Herbert Xu 提交于
When we converted the protocol atomic counters such as the orphan count and the total socket count deadlocks were introduced due to the mismatch in BH status of the spots that used the percpu counter operations. Based on the diagnosis and patch by Peter Zijlstra, this patch fixes these issues by disabling BH where we may be in process context. Reported-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com> Tested-by: NIngo Molnar <mingo@elte.hu> Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 16 12月, 2008 1 次提交
-
-
由 Herbert Xu 提交于
This patch adds the TCP-specific portion of GRO. The criterion for merging is extremely strict (the TCP header must match exactly apart from the checksum) so as to allow refragmentation. Otherwise this is pretty much identical to LRO, except that we support the merging of ECN packets. Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 26 11月, 2008 1 次提交
-
-
由 Eric Dumazet 提交于
Instead of using one atomic_t per protocol, use a percpu_counter for "sockets_allocated", to reduce cache line contention on heavy duty network servers. Note : We revert commit (248969ae net: af_unix can make unix_nr_socks visbile in /proc), since it is not anymore used after sock_prot_inuse_add() addition Signed-off-by: NEric Dumazet <dada1@cosmosbay.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 24 11月, 2008 1 次提交
-
-
由 Eric Dumazet 提交于
This is the last step to be able to perform full RCU lookups in __inet_lookup() : After established/timewait tables, we add RCU lookups to listening hash table. The only trick here is that a socket of a given type (TCP ipv4, TCP ipv6, ...) can now flight between two different tables (established and listening) during a RCU grace period, so we must use different 'nulls' end-of-chain values for two tables. We define a large value : #define LISTENING_NULLS_BASE (1U << 29) So that slots in listening table are guaranteed to have different end-of-chain values than slots in established table. A reader can still detect it finished its lookup in the right chain. Signed-off-by: NEric Dumazet <dada1@cosmosbay.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 21 11月, 2008 1 次提交
-
-
由 Eric Dumazet 提交于
Now TCP & DCCP use RCU lookups, we can convert ehash rwlocks to spinlocks. /proc/net/tcp and other seq_file 'readers' can safely be converted to 'writers'. This should speedup writers, since spin_lock()/spin_unlock() only use one atomic operation instead of two for write_lock()/write_unlock() Signed-off-by: NEric Dumazet <dada1@cosmosbay.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 20 11月, 2008 2 次提交
-
-
由 Eric Dumazet 提交于
This patch prepares RCU migration of listening_hash table for TCP/DCCP protocols. listening_hash table being small (32 slots per protocol), we add a spinlock for each slot, instead of a single rwlock for whole table. This should reduce hold time of readers, and writers concurrency. Signed-off-by: NEric Dumazet <dada1@cosmosbay.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Joe Perches 提交于
The first argument to csum_partial is const void * casts to char/u8 * are not necessary Signed-off-by: NJoe Perches <joe@perches.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 17 11月, 2008 1 次提交
-
-
由 Eric Dumazet 提交于
RCU was added to UDP lookups, using a fast infrastructure : - sockets kmem_cache use SLAB_DESTROY_BY_RCU and dont pay the price of call_rcu() at freeing time. - hlist_nulls permits to use few memory barriers. This patch uses same infrastructure for TCP/DCCP established and timewait sockets. Thanks to SLAB_DESTROY_BY_RCU, no slowdown for applications using short lived TCP connections. A followup patch, converting rwlocks to spinlocks will even speedup this case. __inet_lookup_established() is pretty fast now we dont have to dirty a contended cache line (read_lock/read_unlock) Only established and timewait hashtable are converted to RCU (bind table and listen table are still using traditional locking) Signed-off-by: NEric Dumazet <dada1@cosmosbay.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 03 11月, 2008 1 次提交
-
-
由 Jianjun Kong 提交于
Signed-off-by: NJianjun Kong <jianjun@zeuux.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 31 10月, 2008 1 次提交
-
-
由 Harvey Harrison 提交于
Using NIPQUAD() with NIPQUAD_FMT, %d.%d.%d.%d or %u.%u.%u.%u can be replaced with %pI4 Signed-off-by: NHarvey Harrison <harvey.harrison@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 10 10月, 2008 1 次提交
-
-
由 Ilpo Järvinen 提交于
Maybe it's just me but I guess those md5 people made a mess out of it by having *_md5_hash_* to use daddr, saddr order instead of the one that is natural (and equal to what csum functions use). For the segment were sending, the original addresses are reversed so buff's saddr == skb's daddr and vice-versa. Maybe I can finally proceed with unification of some code after fixing it first... :-) Signed-off-by: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 09 10月, 2008 1 次提交
-
-
由 Ilpo Järvinen 提交于
While looking for some common code I came across difference in checksum calculation between tcp_v6_send_(reset|ack) I couldn't explain. I checked both v4 and v6 and found out that both seem to have the same "feature". I couldn't find anything in rfc nor anywhere else which would state that md5 option should be ignored like it was in case of reset so I came to a conclusion that this is probably a genuine bug. I suspect that addition of md5 just was fooled by the excessive copy-paste code in those functions and the reset part was never tested well enough to find out the problem. Signed-off-by: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 08 10月, 2008 1 次提交
-
-
由 Arnaldo Carvalho de Melo 提交于
To be able to use the cached socket reference in the skb during input processing we add a new set of lookup functions that receive the skb on their argument list. Signed-off-by: NArnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: NKOVACS Krisztian <hidden@sch.bme.hu> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 01 10月, 2008 2 次提交
-
-
由 KOVACS Krisztian 提交于
The TCP stack sends out SYN+ACK/ACK/RST reply packets in response to incoming packets. The non-local source address check on output bites us again, as replies for transparently redirected traffic won't have a chance to leave the node. This patch selectively sets the FLOWI_FLAG_ANYSRC flag when doing the route lookup for those replies. Transparent replies are enabled if the listening socket has the transparent socket flag set. Signed-off-by: NKOVACS Krisztian <hidden@sch.bme.hu> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Vitaliy Gusev 提交于
Fix NULL dereference in tcp_4_send_ack(). As skb->dev is reset to NULL in tcp_v4_rcv() thus OOPS occurs: BUG: unable to handle kernel NULL pointer dereference at 00000000000004d0 IP: [<ffffffff80498503>] tcp_v4_send_ack+0x203/0x250 Stack: ffff810005dbb000 ffff810015c8acc0 e77b2c6e5f861600 a01610802e90cb6d 0a08010100000000 88afffff88afffff 0000000080762be8 0000000115c872e8 0004122000000000 0000000000000001 ffffffff80762b88 0000000000000020 Call Trace: <IRQ> [<ffffffff80499c33>] tcp_v4_reqsk_send_ack+0x20/0x22 [<ffffffff8049bce5>] tcp_check_req+0x108/0x14c [<ffffffff8047aaf7>] ? rt_intern_hash+0x322/0x33c [<ffffffff80499846>] tcp_v4_do_rcv+0x399/0x4ec [<ffffffff8045ce4b>] ? skb_checksum+0x4f/0x272 [<ffffffff80485b74>] ? __inet_lookup_listener+0x14a/0x15c [<ffffffff8049babc>] tcp_v4_rcv+0x6a1/0x701 [<ffffffff8047e739>] ip_local_deliver_finish+0x157/0x24a [<ffffffff8047ec9a>] ip_local_deliver+0x72/0x7c [<ffffffff8047e5bd>] ip_rcv_finish+0x38d/0x3b2 [<ffffffff803d3548>] ? scsi_io_completion+0x19d/0x39e [<ffffffff8047ebe5>] ip_rcv+0x2a2/0x2e5 [<ffffffff80462faa>] netif_receive_skb+0x293/0x303 [<ffffffff80465a9b>] process_backlog+0x80/0xd0 [<ffffffff802630b4>] ? __rcu_process_callbacks+0x125/0x1b4 [<ffffffff8046560e>] net_rx_action+0xb9/0x17f [<ffffffff80234cc5>] __do_softirq+0xa3/0x164 [<ffffffff8020c52c>] call_softirq+0x1c/0x28 <EOI> [<ffffffff8020de1c>] do_softirq+0x34/0x72 [<ffffffff80234b8e>] local_bh_enable_ip+0x3f/0x50 [<ffffffff804d43ca>] _spin_unlock_bh+0x12/0x14 [<ffffffff804599cd>] release_sock+0xb8/0xc1 [<ffffffff804a6f9a>] inet_stream_connect+0x146/0x25c [<ffffffff80243078>] ? autoremove_wake_function+0x0/0x38 [<ffffffff8045751f>] sys_connect+0x68/0x8e [<ffffffff80291818>] ? fd_install+0x5f/0x68 [<ffffffff80457784>] ? sock_map_fd+0x55/0x62 [<ffffffff8020b39b>] system_call_after_swapgs+0x7b/0x80 Code: 41 10 11 d0 83 d0 00 4d 85 ed 89 45 c0 c7 45 c4 08 00 00 00 74 07 41 8b 45 04 89 45 c8 48 8b 43 20 8b 4d b8 48 8d 55 b0 48 89 de <48> 8b 80 d0 04 00 00 48 8b b8 60 01 00 00 e8 20 ae fe ff 65 48 RIP [<ffffffff80498503>] tcp_v4_send_ack+0x203/0x250 RSP <ffffffff80762b78> CR2: 00000000000004d0 Signed-off-by: NVitaliy Gusev <vgusev@openvz.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 21 9月, 2008 1 次提交
-
-
由 Tom Quetchenbach 提交于
I'm trying to use the TCP_MAXSEG option to setsockopt() to set the MSS for both sides of a bidirectional connection. man tcp says: "If this option is set before connection establishment, it also changes the MSS value announced to the other end in the initial packet." However, the kernel only uses the MTU/route cache to set the advertised MSS. That means if I set the MSS to, say, 500 before calling connect(), I will send at most 500-byte packets, but I will still receive 1500-byte packets in reply. This is a bug, either in the kernel or the documentation. This patch (applies to latest net-2.6) reduces the advertised value to that requested by the user as long as setsockopt() is called before connect() or accept(). This seems like the behavior that one would expect as well as that which is documented. I've tried to make sure that things that depend on the advertised MSS are set correctly. Signed-off-by: NTom Quetchenbach <virtualphtn@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 09 9月, 2008 1 次提交
-
-
由 Daniel Lezcano 提交于
How to reproduce ? - create a network namespace - use tcp protocol and get timewait socket - exit the network namespace - after a moment (when the timewait socket is destroyed), the kernel panics. # BUG: unable to handle kernel NULL pointer dereference at 0000000000000007 IP: [<ffffffff821e394d>] inet_twdr_do_twkill_work+0x6e/0xb8 PGD 119985067 PUD 11c5c0067 PMD 0 Oops: 0000 [1] SMP CPU 1 Modules linked in: ipv6 button battery ac loop dm_mod tg3 libphy ext3 jbd edd fan thermal processor thermal_sys sg sata_svw libata dock serverworks sd_mod scsi_mod ide_disk ide_core [last unloaded: freq_table] Pid: 0, comm: swapper Not tainted 2.6.27-rc2 #3 RIP: 0010:[<ffffffff821e394d>] [<ffffffff821e394d>] inet_twdr_do_twkill_work+0x6e/0xb8 RSP: 0018:ffff88011ff7fed0 EFLAGS: 00010246 RAX: ffffffffffffffff RBX: ffffffff82339420 RCX: ffff88011ff7ff30 RDX: 0000000000000001 RSI: ffff88011a4d03c0 RDI: ffff88011ac2fc00 RBP: ffffffff823392e0 R08: 0000000000000000 R09: ffff88002802a200 R10: ffff8800a5c4b000 R11: ffffffff823e4080 R12: ffff88011ac2fc00 R13: 0000000000000001 R14: 0000000000000001 R15: 0000000000000000 FS: 0000000041cbd940(0000) GS:ffff8800bff839c0(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 0000000000000007 CR3: 00000000bd87c000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process swapper (pid: 0, threadinfo ffff8800bff9e000, task ffff88011ff76690) Stack: ffffffff823392e0 0000000000000100 ffffffff821e3a3a 0000000000000008 0000000000000000 ffffffff821e3a61 ffff8800bff7c000 ffffffff8203c7e7 ffff88011ff7ff10 ffff88011ff7ff10 0000000000000021 ffffffff82351108 Call Trace: <IRQ> [<ffffffff821e3a3a>] ? inet_twdr_hangman+0x0/0x9e [<ffffffff821e3a61>] ? inet_twdr_hangman+0x27/0x9e [<ffffffff8203c7e7>] ? run_timer_softirq+0x12c/0x193 [<ffffffff820390d1>] ? __do_softirq+0x5e/0xcd [<ffffffff8200d08c>] ? call_softirq+0x1c/0x28 [<ffffffff8200e611>] ? do_softirq+0x2c/0x68 [<ffffffff8201a055>] ? smp_apic_timer_interrupt+0x8e/0xa9 [<ffffffff8200cad6>] ? apic_timer_interrupt+0x66/0x70 <EOI> [<ffffffff82011f4c>] ? default_idle+0x27/0x3b [<ffffffff8200abbd>] ? cpu_idle+0x5f/0x7d Code: e8 01 00 00 4c 89 e7 41 ff c5 e8 8d fd ff ff 49 8b 44 24 38 4c 89 e7 65 8b 14 25 24 00 00 00 89 d2 48 8b 80 e8 00 00 00 48 f7 d0 <48> 8b 04 d0 48 ff 40 58 e8 fc fc ff ff 48 89 df e8 c0 5f 04 00 RIP [<ffffffff821e394d>] inet_twdr_do_twkill_work+0x6e/0xb8 RSP <ffff88011ff7fed0> CR2: 0000000000000007 This patch provides a function to purge all timewait sockets related to a network namespace. The timewait sockets life cycle is not tied with the network namespace, that means the timewait sockets stay alive while the network namespace dies. The timewait sockets are for avoiding to receive a duplicate packet from the network, if the network namespace is freed, the network stack is removed, so no chance to receive any packets from the outside world. Furthermore, having a pending destruction timer on these sockets with a network namespace freed is not safe and will lead to an oops if the timer callback which try to access data belonging to the namespace like for example in: inet_twdr_do_twkill_work -> NET_INC_STATS_BH(twsk_net(tw), LINUX_MIB_TIMEWAITED); Purging the timewait sockets at the network namespace destruction will: 1) speed up memory freeing for the namespace 2) fix kernel panic on asynchronous timewait destruction Signed-off-by: NDaniel Lezcano <dlezcano@fr.ibm.com> Acked-by: NDenis V. Lunev <den@openvz.org> Acked-by: NEric W. Biederman <ebiederm@xmission.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-