- 21 3月, 2015 2 次提交
-
-
由 Eric Dumazet 提交于
One of the major issue for TCP is the SYNACK rtx handling, done by inet_csk_reqsk_queue_prune(), fired by the keepalive timer of a TCP_LISTEN socket. This function runs for awful long times, with socket lock held, meaning that other cpus needing this lock have to spin for hundred of ms. SYNACK are sent in huge bursts, likely to cause severe drops anyway. This model was OK 15 years ago when memory was very tight. We now can afford to have a timer per request sock. Timer invocations no longer need to lock the listener, and can be run from all cpus in parallel. With following patch increasing somaxconn width to 32 bits, I tested a listener with more than 4 million active request sockets, and a steady SYNFLOOD of ~200,000 SYN per second. Host was sending ~830,000 SYNACK per second. This is ~100 times more what we could achieve before this patch. Later, we will get rid of the listener hash and use ehash instead. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
When request sock are put in ehash table, the whole notion of having a previous request to update dl_next is pointless. Also, following patch will get rid of big purge timer, so we want to delete a request sock without holding listener lock. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 18 3月, 2015 1 次提交
-
-
由 Eric Dumazet 提交于
While testing last patch series, I found req sock refcounting was wrong. We must set skc_refcnt to 1 for all request socks added in hashes, but also on request sockets created by FastOpen or syncookies. It is tricky because we need to defer this initialization so that future RCU lookups do not try to take a refcount on a not yet fully initialized request socket. Also get rid of ireq_refcnt alias. Signed-off-by: NEric Dumazet <edumazet@google.com> Fixes: 13854e5a ("inet: add proper refcounting to request sock") Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 17 3月, 2015 1 次提交
-
-
由 Eric Dumazet 提交于
reqsk_put() is the generic function that should be used to release a refcount (and automatically call reqsk_free()) reqsk_free() might be called if refcount is known to be 0 or undefined. refcnt is set to one in inet_csk_reqsk_queue_add() As request socks are not yet in global ehash table, I added temporary debugging checks in reqsk_put() and reqsk_free() Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 07 3月, 2015 1 次提交
-
-
由 Fan Du 提交于
As per RFC4821 7.3. Selecting Probe Size, a probe timer should be armed once probing has converged. Once this timer expired, probing again to take advantage of any path PMTU change. The recommended probing interval is 10 minutes per RFC1981. Probing interval could be sysctled by sysctl_tcp_probe_interval. Eric Dumazet suggested to implement pseudo timer based on 32bits jiffies tcp_time_stamp instead of using classic timer for such rare event. Signed-off-by: NFan Du <fan.du@intel.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 06 1月, 2015 1 次提交
-
-
由 Daniel Borkmann 提交于
This patch adds necessary infrastructure to the congestion control framework for later per route congestion control support. For a per route congestion control possibility, our aim is to store a unique u32 key identifier into dst metrics, which can then be mapped into a tcp_congestion_ops struct. We argue that having a RTAX key entry is the most simple, generic and easy way to manage, and also keeps the memory footprint of dst entries lower on 64 bit than with storing a pointer directly, for example. Having a unique key id also allows for decoupling actual TCP congestion control module management from the FIB layer, i.e. we don't have to care about expensive module refcounting inside the FIB at this point. We first thought of using an IDR store for the realization, which takes over dynamic assignment of unused key space and also performs the key to pointer mapping in RCU. While doing so, we stumbled upon the issue that due to the nature of dynamic key distribution, it just so happens, arguably in very rare occasions, that excessive module loads and unloads can lead to a possible reuse of previously used key space. Thus, previously stale keys in the dst metric are now being reassigned to a different congestion control algorithm, which might lead to unexpected behaviour. One way to resolve this would have been to walk FIBs on the actually rare occasion of a module unload and reset the metric keys for each FIB in each netns, but that's just very costly. Therefore, we argue a better solution is to reuse the unique congestion control algorithm name member and map that into u32 key space through jhash. For that, we split the flags attribute (as it currently uses 2 bits only anyway) into two u32 attributes, flags and key, so that we can keep the cacheline boundary of 2 cachelines on x86_64 and cache the precalculated key at registration time for the fast path. On average we might expect 2 - 4 modules being loaded worst case perhaps 15, so a key collision possibility is extremely low, and guaranteed collision-free on LE/BE for all in-tree modules. Overall this results in much simpler code, and all without the overhead of an IDR. Due to the deterministic nature, modules can now be unloaded, the congestion control algorithm for a specific but unloaded key will fall back to the default one, and on module reload time it will switch back to the expected algorithm transparently. Joint work with Florian Westphal. Signed-off-by: NFlorian Westphal <fw@strlen.de> Signed-off-by: NDaniel Borkmann <dborkman@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 23 9月, 2014 1 次提交
-
-
由 Eric Dumazet 提交于
icsk_rto is a 32bit field, and icsk_backoff can reach 15 by default, or more if some sysctl (eg tcp_retries2) are changed. Better use 64bit to perform icsk_rto << icsk_backoff operations As Joe Perches suggested, add a helper for this. Yuchung spotted the tcp_v4_err() case. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 15 8月, 2014 1 次提交
-
-
由 Neal Cardwell 提交于
Make sure we use the correct address-family-specific function for handling MTU reductions from within tcp_release_cb(). Previously AF_INET6 sockets were incorrectly always using the IPv6 code path when sometimes they were handling IPv4 traffic and thus had an IPv4 dst. Signed-off-by: NNeal Cardwell <ncardwell@google.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Diagnosed-by: NWillem de Bruijn <willemb@google.com> Fixes: 563d34d0 ("tcp: dont drop MTU reduction indications") Reviewed-by: NHannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 16 4月, 2014 1 次提交
-
-
由 Eric Dumazet 提交于
ip_queue_xmit() assumes the skb it has to transmit is attached to an inet socket. Commit 31c70d59 ("l2tp: keep original skb ownership") changed l2tp to not change skb ownership and thus broke this assumption. One fix is to add a new 'struct sock *sk' parameter to ip_queue_xmit(), so that we do not assume skb->sk points to the socket used by l2tp tunnel. Fixes: 31c70d59 ("l2tp: keep original skb ownership") Reported-by: NZhan Jianyu <nasa4836@gmail.com> Tested-by: NZhan Jianyu <nasa4836@gmail.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 22 9月, 2013 1 次提交
-
-
由 Joe Perches 提交于
There are a mix of function prototypes with and without extern in the kernel sources. Standardize on not using extern for function prototypes. Function prototypes don't need to be written with extern. extern is assumed by the compiler. Its use is as unnecessary as using auto to declare automatic/local variables in a block. Signed-off-by: NJoe Perches <joe@perches.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 12 3月, 2013 1 次提交
-
-
由 Nandita Dukkipati 提交于
This patch series implement the Tail loss probe (TLP) algorithm described in http://tools.ietf.org/html/draft-dukkipati-tcpm-tcp-loss-probe-01. The first patch implements the basic algorithm. TLP's goal is to reduce tail latency of short transactions. It achieves this by converting retransmission timeouts (RTOs) occuring due to tail losses (losses at end of transactions) into fast recovery. TLP transmits one packet in two round-trips when a connection is in Open state and isn't receiving any ACKs. The transmitted packet, aka loss probe, can be either new or a retransmission. When there is tail loss, the ACK from a loss probe triggers FACK/early-retransmit based fast recovery, thus avoiding a costly RTO. In the absence of loss, there is no change in the connection state. PTO stands for probe timeout. It is a timer event indicating that an ACK is overdue and triggers a loss probe packet. The PTO value is set to max(2*SRTT, 10ms) and is adjusted to account for delayed ACK timer when there is only one oustanding packet. TLP Algorithm On transmission of new data in Open state: -> packets_out > 1: schedule PTO in max(2*SRTT, 10ms). -> packets_out == 1: schedule PTO in max(2*RTT, 1.5*RTT + 200ms) -> PTO = min(PTO, RTO) Conditions for scheduling PTO: -> Connection is in Open state. -> Connection is either cwnd limited or no new data to send. -> Number of probes per tail loss episode is limited to one. -> Connection is SACK enabled. When PTO fires: new_segment_exists: -> transmit new segment. -> packets_out++. cwnd remains same. no_new_packet: -> retransmit the last segment. Its ACK triggers FACK or early retransmit based recovery. ACK path: -> rearm RTO at start of ACK processing. -> reschedule PTO if need be. In addition, the patch includes a small variation to the Early Retransmit (ER) algorithm, such that ER and TLP together can in principle recover any N-degree of tail loss through fast recovery. TLP is controlled by the same sysctl as ER, tcp_early_retrans sysctl. tcp_early_retrans==0; disables TLP and ER. ==1; enables RFC5827 ER. ==2; delayed ER. ==3; TLP and delayed ER. [DEFAULT] ==4; TLP only. The TLP patch series have been extensively tested on Google Web servers. It is most effective for short Web trasactions, where it reduced RTOs by 15% and improved HTTP response time (average by 6%, 99th percentile by 10%). The transmitted probes account for <0.5% of the overall transmissions. Signed-off-by: NNandita Dukkipati <nanditad@google.com> Acked-by: NNeal Cardwell <ncardwell@google.com> Acked-by: NYuchung Cheng <ycheng@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 15 12月, 2012 1 次提交
-
-
由 Christoph Paasch 提交于
If in either of the above functions inet_csk_route_child_sock() or __inet_inherit_port() fails, the newsk will not be freed: unreferenced object 0xffff88022e8a92c0 (size 1592): comm "softirq", pid 0, jiffies 4294946244 (age 726.160s) hex dump (first 32 bytes): 0a 01 01 01 0a 01 01 02 00 00 00 00 a7 cc 16 00 ................ 02 00 03 01 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<ffffffff8153d190>] kmemleak_alloc+0x21/0x3e [<ffffffff810ab3e7>] kmem_cache_alloc+0xb5/0xc5 [<ffffffff8149b65b>] sk_prot_alloc.isra.53+0x2b/0xcd [<ffffffff8149b784>] sk_clone_lock+0x16/0x21e [<ffffffff814d711a>] inet_csk_clone_lock+0x10/0x7b [<ffffffff814ebbc3>] tcp_create_openreq_child+0x21/0x481 [<ffffffff814e8fa5>] tcp_v4_syn_recv_sock+0x3a/0x23b [<ffffffff814ec5ba>] tcp_check_req+0x29f/0x416 [<ffffffff814e8e10>] tcp_v4_do_rcv+0x161/0x2bc [<ffffffff814eb917>] tcp_v4_rcv+0x6c9/0x701 [<ffffffff814cea9f>] ip_local_deliver_finish+0x70/0xc4 [<ffffffff814cec20>] ip_local_deliver+0x4e/0x7f [<ffffffff814ce9f8>] ip_rcv_finish+0x1fc/0x233 [<ffffffff814cee68>] ip_rcv+0x217/0x267 [<ffffffff814a7bbe>] __netif_receive_skb+0x49e/0x553 [<ffffffff814a7cc3>] netif_receive_skb+0x50/0x82 This happens, because sk_clone_lock initializes sk_refcnt to 2, and thus a single sock_put() is not enough to free the memory. Additionally, things like xfrm, memcg, cookie_values,... may have been initialized. We have to free them properly. This is fixed by forcing a call to tcp_done(), ending up in inet_csk_destroy_sock, doing the final sock_put(). tcp_done() is necessary, because it ends up doing all the cleanup on xfrm, memcg, cookie_values, xfrm,... Before calling tcp_done, we have to set the socket to SOCK_DEAD, to force it entering inet_csk_destroy_sock. To avoid the warning in inet_csk_destroy_sock, inet_num has to be set to 0. As inet_csk_destroy_sock does a dec on orphan_count, we first have to increase it. Calling tcp_done() allows us to remove the calls to tcp_clear_xmit_timer() and tcp_cleanup_congestion_control(). A similar approach is taken for dccp by calling dccp_done(). This is in the kernel since 093d2823 (tproxy: fix hash locking issue when using port redirection in __inet_inherit_port()), thus since version >= 2.6.37. Signed-off-by: NChristoph Paasch <christoph.paasch@uclouvain.be> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 07 8月, 2012 1 次提交
-
-
由 Eric Dumazet 提交于
IPv6 needs a cookie in dst_check() call. We need to add rx_dst_cookie and provide a family independent sk_rx_dst_set(sk, skb) method to properly support IPv6 TCP early demux. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 21 7月, 2012 1 次提交
-
-
由 David S. Miller 提交于
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 16 7月, 2012 1 次提交
-
-
由 David S. Miller 提交于
This abstracts away the call to dst_ops->update_pmtu() so that we can transparently handle the fact that, in the future, the dst itself can be invalidated by the PMTU update (when we have non-host routes cached in sockets). So we try to rebuild the socket cached route after the method invocation if necessary. This isn't used by SCTP because it needs to cache dsts per-transport, and thus will need it's own local version of this helper. Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 11 7月, 2012 1 次提交
-
-
由 David S. Miller 提交于
No longer used. Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 23 6月, 2012 1 次提交
-
-
由 Eric Dumazet 提交于
Don't cache output dst for syncookies, as this adds pressure on IP route cache and rcu subsystem for no gain. Signed-off-by: NEric Dumazet <edumazet@google.com> Cc: Hans Schillstrom <hans.schillstrom@ericsson.com> Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 09 6月, 2012 1 次提交
-
-
由 David S. Miller 提交于
The get_peer method TCP uses is full of special cases that make no sense accommodating, and it also gets in the way of doing more reasonable things here. First of all, if the socket doesn't have a usable cached route, there is no sense in trying to optimize timewait recycling. Likewise for the case where we have IP options, such as SRR enabled, that make the IP header destination address (and thus the destination address of the route key) differ from that of the connection's destination address. Just return a NULL peer in these cases, and thus we're also able to get rid of the clumsy inetpeer release logic. Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 27 4月, 2012 1 次提交
-
-
由 Eric Dumazet 提交于
Quoting Tore Anderson from : https://bugzilla.kernel.org/show_bug.cgi?id=42572 When RTAX_FEATURE_ALLFRAG is set on a route, the effective TCP segment size does not take into account the size of the IPv6 Fragmentation header that needs to be included in outbound packets, causing every transmitted TCP segment to be fragmented across two IPv6 packets, the latter of which will only contain 8 bytes of actual payload. RTAX_FEATURE_ALLFRAG is typically set on a route in response to receving a ICMPv6 Packet Too Big message indicating a Path MTU of less than 1280 bytes. 1280 bytes is the minimum IPv6 MTU, however ICMPv6 PTBs with MTU < 1280 are still valid, in particular when an IPv6 packet is sent to an IPv4 destination through a stateless translator. Any ICMPv4 Need To Fragment packets originated from the IPv4 part of the path will be translated to ICMPv6 PTB which may then indicate an MTU of less than 1280. The Linux kernel refuses to reduce the effective MTU to anything below 1280 bytes, instead it sets it to exactly 1280 bytes, and RTAX_FEATURE_ALLFRAG is also set. However, the TCP segment size appears to be set to 1240 bytes (1280 Path MTU - 40 bytes of IPv6 header), instead of 1232 (additionally taking into account the 8 bytes required by the IPv6 Fragmentation extension header). This in turn results in rather inefficient transmission, as every transmitted TCP segment now is split in two fragments containing 1232+8 bytes of payload. After this patch, all the outgoing packets that includes a Fragmentation header all are "atomic" or "non-fragmented" fragments, i.e., they both have Offset=0 and More Fragments=0. With help from David S. Miller Reported-by: NTore Anderson <tore@fud.no> Signed-off-by: NEric Dumazet <edumazet@google.com> Cc: Maciej Żenczykowski <maze@google.com> Cc: Tom Herbert <therbert@google.com> Tested-by: NTore Anderson <tore@fud.no> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 15 4月, 2012 1 次提交
-
-
由 Alex Copot 提交于
We must try harder to get unique (addr, port) pairs when doing port autoselection for sockets with SO_REUSEADDR option set. We achieve this by adding a relaxation parameter to inet_csk_bind_conflict. When 'relax' parameter is off we return a conflict whenever the current searched pair (addr, port) is not unique. This tries to address the problems reported in patch: 8d238b25 Revert "tcp: bind() fix when many ports are bound" Tests where ran for creating and binding(0) many sockets on 100 IPs. The results are, on average: * 60000 sockets, 600 ports / IP: * 0.210 s, 620 (IP, port) duplicates without patch * 0.219 s, no duplicates with patch * 100000 sockets, 1000 ports / IP: * 0.371 s, 1720 duplicates without patch * 0.373 s, no duplicates with patch * 200000 sockets, 2000 ports / IP: * 0.766 s, 6900 duplicates without patch * 0.768 s, no duplicates with patch * 500000 sockets, 5000 ports / IP: * 2.227 s, 41500 duplicates without patch * 2.284 s, no duplicates with patch Signed-off-by: NAlex Copot <alex.mihai.c@gmail.com> Signed-off-by: NDaniel Baluta <dbaluta@ixiacom.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 09 11月, 2011 1 次提交
-
-
由 Eric Dumazet 提交于
Make clear that sk_clone() and inet_csk_clone() return a locked socket. Add _lock() prefix and kerneldoc. Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 19 5月, 2011 1 次提交
-
-
由 David S. Miller 提交于
This way the caller can get at the fully resolved fl4->{daddr,saddr} etc. Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 09 5月, 2011 2 次提交
-
-
由 David S. Miller 提交于
This allows us to acquire the exact route keying information from the protocol, however that might be managed. It handles all of the possibilities, from the simplest case of storing the key in inet->cork.fl to the more complex setup SCTP has where individual transports determine the flow. Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 David S. Miller 提交于
This is just like inet_csk_route_req() except that it operates after we've created the new child socket. In this way we can use the new socket's cork flow for proper route key storage. This will be used by DCCP and TCP child socket creation handling. Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 20 12月, 2010 1 次提交
-
-
由 Shan Wei 提交于
These macros never be used, so remove them. Signed-off-by: NShan Wei <shanwei@cn.fujitsu.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 01 12月, 2010 1 次提交
-
-
由 David S. Miller 提交于
Then we can make a completely generic tcp_remember_stamp() that uses ->get_peer() as a helper, minimizing the AF specific code and minimizing the eventual code duplication when we implement the ipv6 side of TW recycling. Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 31 8月, 2010 1 次提交
-
-
由 Jerry Chu 提交于
This patch provides a "user timeout" support as described in RFC793. The socket option is also needed for the the local half of RFC5482 "TCP User Timeout Option". TCP_USER_TIMEOUT is a TCP level socket option that takes an unsigned int, when > 0, to specify the maximum amount of time in ms that transmitted data may remain unacknowledged before TCP will forcefully close the corresponding connection and return ETIMEDOUT to the application. If 0 is given, TCP will continue to use the system default. Increasing the user timeouts allows a TCP connection to survive extended periods without end-to-end connectivity. Decreasing the user timeouts allows applications to "fail fast" if so desired. Otherwise it may take upto 20 minutes with the current system defaults in a normal WAN environment. The socket option can be made during any state of a TCP connection, but is only effective during the synchronized states of a connection (ESTABLISHED, FIN-WAIT-1, FIN-WAIT-2, CLOSE-WAIT, CLOSING, or LAST-ACK). Moreover, when used with the TCP keepalive (SO_KEEPALIVE) option, TCP_USER_TIMEOUT will overtake keepalive to determine when to close a connection due to keepalive failure. The option does not change in anyway when TCP retransmits a packet, nor when a keepalive probe will be sent. This option, like many others, will be inherited by an acceptor from its listener. Signed-off-by: NH.K. Jerry Chu <hkchu@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 16 4月, 2010 1 次提交
-
-
由 Shan Wei 提交于
As Herbert Xu said: we should be able to simply replace ipfragok with skb->local_df. commit f88037(sctp: Drop ipfargok in sctp_xmit function) has droped ipfragok and set local_df value properly. The patch kills the ipfragok parameter of .queue_xmit(). Signed-off-by: NShan Wei <shanwei@cn.fujitsu.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 12 4月, 2010 1 次提交
-
-
由 Herbert Xu 提交于
inet: Remove unused send_check length argument This patch removes the unused length argument from the send_check function in struct inet_connection_sock_af_ops. Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au> Tested-by: NYinghai <yinghai.lu@oracle.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 01 10月, 2009 1 次提交
-
-
由 David S. Miller 提交于
This provides safety against negative optlen at the type level instead of depending upon (sometimes non-trivial) checks against this sprinkled all over the the place, in each and every implementation. Based upon work done by Arjan van de Ven and feedback from Linus Torvalds. Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 28 8月, 2008 1 次提交
-
-
由 Alexey Dobriyan 提交于
All users of struct proto::compat_[gs]etsockopt and struct inet_connection_sock_af_ops::compat_[gs]etsockopt are under #ifdef already, so use it in structure definition too. Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 04 4月, 2008 1 次提交
-
-
由 Denis V. Lunev 提交于
This call is nothing common with INET connection sockets code. It simply creates an unhashes kernel sockets for protocol messages. Move the new call into af_inet.c after the rename. Signed-off-by: NDenis V. Lunev <den@openvz.org> Acked-by: NArnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 03 2月, 2008 1 次提交
-
-
由 Arnaldo Carvalho de Melo 提交于
This way we can remove TCP and DCCP specific versions of sk->sk_prot->get_port: both v4 and v6 use inet_csk_get_port sk->sk_prot->hash: inet_hash is directly used, only v6 need a specific version to deal with mapped sockets sk->sk_prot->unhash: both v4 and v6 use inet_hash directly struct inet_connection_sock_af_ops also gets a new member, bind_conflict, so that inet_csk_get_port can find the per family routine. Now only the lookup routines receive as a parameter a struct inet_hashtable. With this we further reuse code, reducing the difference among INET transport protocols. Eventually work has to be done on UDP and SCTP to make them share this infrastructure and get as a bonus inet_diag interfaces so that iproute can be used with these protocols. net-2.6/net/ipv4/inet_hashtables.c: struct proto | +8 struct inet_connection_sock_af_ops | +8 2 structs changed __inet_hash_nolisten | +18 __inet_hash | -210 inet_put_port | +8 inet_bind_bucket_create | +1 __inet_hash_connect | -8 5 functions changed, 27 bytes added, 218 bytes removed, diff: -191 net-2.6/net/core/sock.c: proto_seq_show | +3 1 function changed, 3 bytes added, diff: +3 net-2.6/net/ipv4/inet_connection_sock.c: inet_csk_get_port | +15 1 function changed, 15 bytes added, diff: +15 net-2.6/net/ipv4/tcp.c: tcp_set_state | -7 1 function changed, 7 bytes removed, diff: -7 net-2.6/net/ipv4/tcp_ipv4.c: tcp_v4_get_port | -31 tcp_v4_hash | -48 tcp_v4_destroy_sock | -7 tcp_v4_syn_recv_sock | -2 tcp_unhash | -179 5 functions changed, 267 bytes removed, diff: -267 net-2.6/net/ipv6/inet6_hashtables.c: __inet6_hash | +8 1 function changed, 8 bytes added, diff: +8 net-2.6/net/ipv4/inet_hashtables.c: inet_unhash | +190 inet_hash | +242 2 functions changed, 432 bytes added, diff: +432 vmlinux: 16 functions changed, 485 bytes added, 492 bytes removed, diff: -7 /home/acme/git/net-2.6/net/ipv6/tcp_ipv6.c: tcp_v6_get_port | -31 tcp_v6_hash | -7 tcp_v6_syn_recv_sock | -9 3 functions changed, 47 bytes removed, diff: -47 /home/acme/git/net-2.6/net/dccp/proto.c: dccp_destroy_sock | -7 dccp_unhash | -179 dccp_hash | -49 dccp_set_state | -7 dccp_done | +1 5 functions changed, 1 bytes added, 242 bytes removed, diff: -241 /home/acme/git/net-2.6/net/dccp/ipv4.c: dccp_v4_get_port | -31 dccp_v4_request_recv_sock | -2 2 functions changed, 33 bytes removed, diff: -33 /home/acme/git/net-2.6/net/dccp/ipv6.c: dccp_v6_get_port | -31 dccp_v6_hash | -7 dccp_v6_request_recv_sock | +5 3 functions changed, 5 bytes added, 38 bytes removed, diff: -33 Signed-off-by: NArnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 26 1月, 2007 1 次提交
-
-
由 David S. Miller 提交于
Revert 93173112 We can't elide the skb_set_owner_w() here because things like certain netfilter targets (such as owner MATCH) need a socket to be set on the SKB for correct operation. Thanks to Jan Engelhardt and other netfilter list members for pointing this out. Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 04 12月, 2006 1 次提交
-
-
由 Al Viro 提交于
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 03 12月, 2006 2 次提交
-
-
由 Arnaldo Carvalho de Melo 提交于
We have a hole in: [acme@newtoy net-2.6.20]$ pahole net/ipv6/tcp_ipv6.o inet_connection_sock_af_ops /* /pub/scm/linux/kernel/git/acme/net-2.6.20/include/net/inet_connection_sock.h:38 */ struct inet_connection_sock_af_ops { int (*queue_xmit)(); /* 0 4 */ void (*send_check)(); /* 4 4 */ int (*rebuild_header)(); /* 8 4 */ int (*conn_request)(); /* 12 4 */ struct sock * (*syn_recv_sock)(); /* 16 4 */ int (*remember_stamp)(); /* 20 4 */ __u16 net_header_len; /* 24 2 */ /* XXX 2 bytes hole, try to pack */ int (*setsockopt)(); /* 28 4 */ int (*getsockopt)(); /* 32 4 */ int (*compat_setsockopt)(); /* 36 4 */ int (*compat_getsockopt)(); /* 40 4 */ void (*addr2sockaddr)(); /* 44 4 */ int sockaddr_len; /* 48 4 */ }; /* size: 52, sum members: 50, holes: 1, sum holes: 2 */ But we don't need sockaddr_len to be an int: [acme@newtoy net-2.6.20]$ find net -name "*.[ch]" | xargs grep '\.sockaddr_len.\+=' | sort -u net/dccp/ipv4.c: .sockaddr_len = sizeof(struct sockaddr_in), net/dccp/ipv6.c: .sockaddr_len = sizeof(struct sockaddr_in6), net/ipv4/tcp_ipv4.c: .sockaddr_len = sizeof(struct sockaddr_in), net/ipv6/tcp_ipv6.c: .sockaddr_len = sizeof(struct sockaddr_in6), net/sctp/ipv6.c: .sockaddr_len = sizeof(struct sockaddr_in6), net/sctp/protocol.c: .sockaddr_len = sizeof(struct sockaddr_in), [acme@newtoy net-2.6.20]$ pahole --sizes net/ipv6/tcp_ipv6.o | grep sockaddr_in struct sockaddr_in: 16 0 struct sockaddr_in6: 28 0 [acme@newtoy net-2.6.20]$ So I turned sockaddr_len a 'u16', and now: [acme@newtoy net-2.6.20]$ pahole net/ipv6/tcp_ipv6.o inet_connection_sock_af_ops /* /pub/scm/linux/kernel/git/acme/net-2.6.20/include/net/inet_connection_sock.h:38 */ struct inet_connection_sock_af_ops { int (*queue_xmit)(); /* 0 4 */ void (*send_check)(); /* 4 4 */ int (*rebuild_header)(); /* 8 4 */ int (*conn_request)(); /* 12 4 */ struct sock * (*syn_recv_sock)(); /* 16 4 */ int (*remember_stamp)(); /* 20 4 */ u16 net_header_len; /* 24 2 */ u16 sockaddr_len; /* 26 2 */ int (*setsockopt)(); /* 28 4 */ int (*getsockopt)(); /* 32 4 */ int (*compat_setsockopt)(); /* 36 4 */ int (*compat_getsockopt)(); /* 40 4 */ void (*addr2sockaddr)(); /* 44 4 */ }; /* size: 48 */ So we've saved 4 bytes: [acme@newtoy net-2.6.20]$ codiff -sV /tmp/tcp_ipv6.o.before net/ipv6/tcp_ipv6.o /pub/scm/linux/kernel/git/acme/net-2.6.20/net/ipv6/tcp_ipv6.c: struct inet_connection_sock_af_ops | -4 net_header_len; from: __u16 /* 24(0) 2(0) */ to: u16 /* 24(0) 2(0) */ sockaddr_len; from: int /* 48(0) 4(0) */ to: u16 /* 26(0) 2(0) */ 1 struct changed [acme@newtoy net-2.6.20]$ Signed-off-by: NArnaldo Carvalho de Melo <acme@mandriva.com>
-
由 David S. Miller 提交于
The data itself is already charged to the SKB, doing the skb_set_owner_w() just generates a lot of noise and extra atomics we don't really need. Lmbench improvements on lat_tcp are minimal: before: TCP latency using localhost: 23.2701 microseconds TCP latency using localhost: 23.1994 microseconds TCP latency using localhost: 23.2257 microseconds after: TCP latency using localhost: 22.8380 microseconds TCP latency using localhost: 22.9465 microseconds TCP latency using localhost: 22.8462 microseconds Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 29 9月, 2006 2 次提交
-
-
由 Al Viro 提交于
rport argument is net-endian Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Al Viro 提交于
raddr is net-endian Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 23 9月, 2006 1 次提交
-
-
由 Alexey Kuznetsov 提交于
It does not affect either mss-sized connections (obviously) or connections controlled by Nagle (because there is only one small segment in flight). The idea is to record the fact that a small segment arrives on a connection, where one small segment has already been received and still not-ACKed. In this case ACK is forced after tcp_recvmsg() drains receive buffer. In other words, it is a "soft" each-2nd-segment ACK, which is enough to preserve ACK clock even when ABC is enabled. Signed-off-by: NAlexey Kuznetsov <kuznet@ms2.inr.ac.ru> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-