- 04 5月, 2012 1 次提交
-
-
由 Eric Dumazet 提交于
GRO is very optimistic in skb truesize estimates, only taking into account the used part of fragments. Be conservative, and use more precise computation, so that bloated GRO skbs can be collapsed eventually. Signed-off-by: NEric Dumazet <edumazet@google.com> Cc: Alexander Duyck <alexander.h.duyck@intel.com> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Acked-by: NAlexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 03 5月, 2012 10 次提交
-
-
由 Alexander Duyck 提交于
This change cleans up the last bits of tcp_try_coalesce so that we only need one goto which jumps to the end of the function. The idea is to make the code more readable by putting things in a linear order so that we start execution at the top of the function, and end it at the bottom. I also made a slight tweak to the code for handling frags when we are a clone. Instead of making it an if (clone) loop else nr_frags = 0 I changed the logic so that if (!clone) we just set the number of frags to 0 which disables the for loop anyway. Signed-off-by: NAlexander Duyck <alexander.h.duyck@intel.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Acked-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Alexander Duyck 提交于
This change reorders the code related to the use of an skb->head_frag so it is placed before we check the rest of the frags. This allows the code to read more linearly instead of like some sort of loop. Signed-off-by: NAlexander Duyck <alexander.h.duyck@intel.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Acked-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Alexander Duyck 提交于
This patch addresses several issues in the way we were tracking the truesize in tcp_try_coalesce. First it was using ksize which prevents us from having a 0 sized head frag and getting a usable result. To resolve that this patch uses the end pointer which is set based off either ksize, or the frag_size supplied in build_skb. This allows us to compute the original truesize of the entire buffer and remove that value leaving us with just what was added as pages. The second issue was the use of skb->len if there is a mergeable head frag. We should only need to remove the size of an data aligned sk_buff from our current skb->truesize to compute the delta for a buffer with a reused head. By using skb->len the value of truesize was being artificially reduced which means that head frags could use more memory than buffers using standard allocations. Signed-off-by: NAlexander Duyck <alexander.h.duyck@intel.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Acked-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 David S. Miller 提交于
Reported-by: NStephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Alexander Duyck 提交于
This change is meant ot prevent stealing the skb->head to use as a page in the event that the skb->head was cloned. This allows the other clones to track each other via shinfo->dataref. Without this we break down to two methods for tracking the reference count, one being dataref, the other being the page count. As a result it becomes difficult to track how many references there are to skb->head. Signed-off-by: NAlexander Duyck <alexander.h.duyck@intel.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Acked-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
Extend tcp coalescing implementing it from tcp_queue_rcv(), the main receiver function when application is not blocked in recvmsg(). Function tcp_queue_rcv() is moved a bit to allow its call from tcp_data_queue() This gives good results especially if GRO could not kick, and if skb head is a fragment. Signed-off-by: NEric Dumazet <edumazet@google.com> Cc: Alexander Duyck <alexander.h.duyck@intel.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Tom Herbert <therbert@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
Before stealing fragments or skb head, we must make sure skbs are not cloned. Alexander was worried about destination skb being cloned : In bridge setups, a driver could be fooled if skb->data_len would not match skb nr_frags. If source skb is cloned, we must take references on pages instead. Bug happened using tcpdump (if not using mmap()) Introduce kfree_skb_partial() helper to cleanup code. Reported-by: NAlexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Yuchung Cheng 提交于
Implementing the advanced early retransmit (sysctl_tcp_early_retrans==2). Delays the fast retransmit by an interval of RTT/4. We borrow the RTO timer to implement the delay. If we receive another ACK or send a new packet, the timer is cancelled and restored to original RTO value offset by time elapsed. When the delayed-ER timer fires, we enter fast recovery and perform fast retransmit. Signed-off-by: NYuchung Cheng <ycheng@google.com> Acked-by: NNeal Cardwell <ncardwell@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Yuchung Cheng 提交于
This patch implements RFC 5827 early retransmit (ER) for TCP. It reduces DUPACK threshold (dupthresh) if outstanding packets are less than 4 to recover losses by fast recovery instead of timeout. While the algorithm is simple, small but frequent network reordering makes this feature dangerous: the connection repeatedly enter false recovery and degrade performance. Therefore we implement a mitigation suggested in the appendix of the RFC that delays entering fast recovery by a small interval, i.e., RTT/4. Currently ER is conservative and is disabled for the rest of the connection after the first reordering event. A large scale web server experiment on the performance impact of ER is summarized in section 6 of the paper "Proportional Rate Reduction for TCP”, IMC 2011. http://conferences.sigcomm.org/imc/2011/docs/p155.pdf Note that Linux has a similar feature called THIN_DUPACK. The differences are THIN_DUPACK do not mitigate reorderings and is only used after slow start. Currently ER is disabled if THIN_DUPACK is enabled. I would be happy to merge THIN_DUPACK feature with ER if people think it's a good idea. ER is enabled by sysctl_tcp_early_retrans: 0: Disables ER 1: Reduce dupthresh to packets_out - 1 when outstanding packets < 4. 2: (Default) reduce dupthresh like mode 1. In addition, delay entering fast recovery by RTT/4. Note: mode 2 is implemented in the third part of this patch series. Signed-off-by: NYuchung Cheng <ycheng@google.com> Acked-by: NNeal Cardwell <ncardwell@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Yuchung Cheng 提交于
This a prepartion patch that refactors the code to enter recovery into a new function tcp_enter_recovery(). It's needed to implement the delayed fast retransmit in ER. Signed-off-by: NYuchung Cheng <ycheng@google.com> Acked-by: NNeal Cardwell <ncardwell@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 01 5月, 2012 18 次提交
-
-
由 Eric Dumazet 提交于
Add ECN (Explicit Congestion Notification) marking capability to netem tc qdisc add dev eth0 root netem drop 0.5 ecn Instead of dropping packets, try to ECN mark them. Signed-off-by: NEric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Hagen Paul Pfeifer <hagen@jauu.net> Cc: Stephen Hemminger <shemminger@vyatta.com> Acked-by: NHagen Paul Pfeifer <hagen@jauu.net> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
TCP or UDP stacks have big enough latencies that prefetching next pointer is worth it. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 James Chapman 提交于
The netlink API lets users create unmanaged L2TPv3 tunnels using iproute2. Until now, a request to create an unmanaged L2TPv3 IP encapsulation tunnel over IPv6 would be rejected with EPROTONOSUPPORT. Now that l2tp_ip6 implements sockets for L2TP IP encapsulation over IPv6, we can add support for that tunnel type. Signed-off-by: NJames Chapman <jchapman@katalix.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Chris Elston 提交于
L2TPv3 defines an IP encapsulation packet format where data is carried directly over IP (no UDP). The kernel already has support for L2TP IP encapsulation over IPv4 (l2tp_ip). This patch introduces support for L2TP IP encapsulation over IPv6. The implementation is derived from ipv6/raw and ipv4/l2tp_ip. Signed-off-by: NChris Elston <celston@katalix.com> Signed-off-by: NJames Chapman <jchapman@katalix.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Chris Elston 提交于
For implementing other protocols on top of IPv6, such as L2TPv3's IP encapsulation over ipv6, we'd like to call some IPv6 functions which are not currently exported. This patch exports them. Signed-off-by: NChris Elston <celston@katalix.com> Signed-off-by: NJames Chapman <jchapman@katalix.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Chris Elston 提交于
This patch adds support for unmanaged L2TPv3 tunnels over IPv6 using the netlink API. We already support unmanaged L2TPv3 tunnels over IPv4. A patch to iproute2 to make use of this feature will be submitted separately. Signed-off-by: NChris Elston <celston@katalix.com> Signed-off-by: NJames Chapman <jchapman@katalix.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Chris Elston 提交于
If an L2TP tunnel uses IPv6, make sure the l2tp debugfs file shows the IPv6 address correctly. Signed-off-by: NChris Elston <celston@katalix.com> Signed-off-by: NJames Chapman <jchapman@katalix.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 James Chapman 提交于
Userspace uses connect() to associate a pppol2tp socket with a tunnel socket. This needs to allow the caller to supply the new IPv6 sockaddr_pppol2tp structures if IPv6 is used. Signed-off-by: NJames Chapman <jchapman@katalix.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 James Chapman 提交于
The l2tp_ip socket currently maintains packet/byte stats in its private socket structure. But these counters aren't exposed to userspace and so serve no purpose. The counters were also smp-unsafe. So this patch just gets rid of the stats. While here, change a couple of internal __u32 variables to u32. Signed-off-by: NJames Chapman <jchapman@katalix.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 James Chapman 提交于
Cleanup the l2tp_ip code to make use of an existing ipv4 support function. Signed-off-by: NJames Chapman <jchapman@katalix.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 James Chapman 提交于
L2TP uses 64-bit counters but since these are not updated atomically, we need to make them safe for smp. This patch addresses that. Signed-off-by: NJames Chapman <jchapman@katalix.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
__skb_splice_bits() can check if skb to be spliced has its skb->head mapped to a page fragment, instead of a kmalloc() area. If so we can avoid a copy of the skb head and get a reference on underlying page. Signed-off-by: NEric Dumazet <edumazet@google.com> Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Maciej Żenczykowski <maze@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Cc: Ben Hutchings <bhutchings@solarflare.com> Cc: Matt Carlson <mcarlson@broadcom.com> Cc: Michael Chan <mchan@broadcom.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
TCP coalesce can check if skb to be merged has its skb->head mapped to a page fragment, instead of a kmalloc() area. We had to disable coalescing in this case, for performance reasons. We 'upgrade' skb->head as a fragment in itself. This reduces number of cache misses when user makes its copies, since a less sk_buff are fetched. This makes receive and ofo queues shorter and thus reduce cache line misses in TCP stack. This is a followup of patch "net: allow skb->head to be a page fragment" Tested with tg3 nic, with GRO on or off. We can see "TCPRcvCoalesce" counter being incremented. Signed-off-by: NEric Dumazet <edumazet@google.com> Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Maciej Żenczykowski <maze@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Cc: Ben Hutchings <bhutchings@solarflare.com> Cc: Matt Carlson <mcarlson@broadcom.com> Cc: Michael Chan <mchan@broadcom.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
GRO can check if skb to be merged has its skb->head mapped to a page fragment, instead of a kmalloc() area. We 'upgrade' skb->head as a fragment in itself This avoids the frag_list fallback, and permits to build true GRO skb (one sk_buff and up to 16 fragments), using less memory. This reduces number of cache misses when user makes its copy, since a single sk_buff is fetched. This is a followup of patch "net: allow skb->head to be a page fragment" Signed-off-by: NEric Dumazet <edumazet@google.com> Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Maciej Żenczykowski <maze@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Cc: Ben Hutchings <bhutchings@solarflare.com> Cc: Matt Carlson <mcarlson@broadcom.com> Cc: Michael Chan <mchan@broadcom.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
skb->head is currently allocated from kmalloc(). This is convenient but has the drawback the data cannot be converted to a page fragment if needed. We have three spots were it hurts : 1) GRO aggregation When a linear skb must be appended to another skb, GRO uses the frag_list fallback, very inefficient since we keep all struct sk_buff around. So drivers enabling GRO but delivering linear skbs to network stack aren't enabling full GRO power. 2) splice(socket -> pipe). We must copy the linear part to a page fragment. This kind of defeats splice() purpose (zero copy claim) 3) TCP coalescing. Recently introduced, this permits to group several contiguous segments into a single skb. This shortens queue lengths and save kernel memory, and greatly reduce probabilities of TCP collapses. This coalescing doesnt work on linear skbs (or we would need to copy data, this would be too slow) Given all these issues, the following patch introduces the possibility of having skb->head be a fragment in itself. We use a new skb flag, skb->head_frag to carry this information. build_skb() is changed to accept a frag_size argument. Drivers willing to provide a page fragment instead of kmalloc() data will set a non zero value, set to the fragment size. Then, on situations we need to convert the skb head to a frag in itself, we can check if skb->head_frag is set and avoid the copies or various fallbacks we have. This means drivers currently using frags could be updated to avoid the current skb->head allocation and reduce their memory footprint (aka skb truesize). (thats 512 or 1024 bytes saved per skb). This also makes bpf/netfilter faster since the 'first frag' will be part of skb linear part, no need to copy data. Signed-off-by: NEric Dumazet <edumazet@google.com> Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Maciej Żenczykowski <maze@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Cc: Ben Hutchings <bhutchings@solarflare.com> Cc: Matt Carlson <mcarlson@broadcom.com> Cc: Michael Chan <mchan@broadcom.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Paul Gortmaker 提交于
Some of the comment blocks are floating in limbo between two functions, or between blocks of code. Delete the extra line feeds between any comment and its associated following block of code, to be consistent with the majority of the rest of the kernel. Also delete trailing newlines at EOF and fix a couple trivial typos in existing comments. This is a 100% cosmetic change with no runtime impact. We get rid of over 500 lines of non-code, and being blank line deletes, they won't even show up as noise in git blame. Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
-
由 Herbert Xu 提交于
Unfortunately it seems that I didn't properly test the case of an expired external querier in the recent multicast bridge series. The setup of the timer in that case is completely broken and leads to a NULL-pointer dereference. This patch fixes it. Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au> Acked-by: NStephen Hemminger <shemminger@vyatta.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 David S. Miller 提交于
Reported-by: NStephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 29 4月, 2012 5 次提交
-
-
由 Benjamin LaHaise 提交于
Now that encap_rcv() works on IPv6 UDP sockets, wire L2TP up to IPv6. Support has been tested with and without hardware offloading. This version fixes the L2TP over localhost issue with incorrect checksums being reported. Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Benjamin LaHaise 提交于
Now that the sematics of udpv6_queue_rcv_skb() match IPv4's udp_queue_rcv_skb(), introduce the UDP encap_rcv() hook for IPv6. Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Benjamin LaHaise 提交于
In order to make sure that when the encap_rcv() hook is introduced it is not called with the socket lock held, move socket locking from callers into udpv6_queue_rcv_skb(), matching what happens in IPv4. Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Benjamin LaHaise 提交于
This is the first step in reworking the IPv6 UDP code to be structured more like the IPv4 UDP code. This patch creates __udpv6_queue_rcv_skb() with the equivalent sematics to __udp_queue_rcv_skb(), and wires it up to the backlog_rcv method. Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jeffrin Jose 提交于
Fixed a coding style issue relating to spaces in net/core/sock.c Signed-off-by: NJeffrin Jose <ahiliation@yahoo.co.in> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 27 4月, 2012 6 次提交
-
-
由 Allan Stephens 提交于
Adds check to ensure TIPC sockets reject incoming payload messages that have an unrecognized message type. Remove the old open question about whether TIPC_ERR_NO_PORT is the proper return value. It is appropriate here since there are valid instances where another node can make use of the reply, and at this point in time the host is already broadcasting TIPC data, so there are no real security concerns. Signed-off-by: NAllan Stephens <allan.stephens@windriver.com> Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
-
由 Eric Dumazet 提交于
Use min_t()/max_t() macros, reformat two comments, use !!test_bit() to match !!sock_flag() Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 hartleys 提交于
Include the header to pickup the definitions of the global symbols. Quiets the following sparse warnings: warning: symbol 'crush_find_rule' was not declared. Should it be static? warning: symbol 'crush_do_rule' was not declared. Should it be static? Signed-off-by: NH Hartley Sweeten <hsweeten@visionengravers.com> Cc: Sage Weil <sage@newdream.net> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
Quoting Tore Anderson from : https://bugzilla.kernel.org/show_bug.cgi?id=42572 When RTAX_FEATURE_ALLFRAG is set on a route, the effective TCP segment size does not take into account the size of the IPv6 Fragmentation header that needs to be included in outbound packets, causing every transmitted TCP segment to be fragmented across two IPv6 packets, the latter of which will only contain 8 bytes of actual payload. RTAX_FEATURE_ALLFRAG is typically set on a route in response to receving a ICMPv6 Packet Too Big message indicating a Path MTU of less than 1280 bytes. 1280 bytes is the minimum IPv6 MTU, however ICMPv6 PTBs with MTU < 1280 are still valid, in particular when an IPv6 packet is sent to an IPv4 destination through a stateless translator. Any ICMPv4 Need To Fragment packets originated from the IPv4 part of the path will be translated to ICMPv6 PTB which may then indicate an MTU of less than 1280. The Linux kernel refuses to reduce the effective MTU to anything below 1280 bytes, instead it sets it to exactly 1280 bytes, and RTAX_FEATURE_ALLFRAG is also set. However, the TCP segment size appears to be set to 1240 bytes (1280 Path MTU - 40 bytes of IPv6 header), instead of 1232 (additionally taking into account the 8 bytes required by the IPv6 Fragmentation extension header). This in turn results in rather inefficient transmission, as every transmitted TCP segment now is split in two fragments containing 1232+8 bytes of payload. After this patch, all the outgoing packets that includes a Fragmentation header all are "atomic" or "non-fragmented" fragments, i.e., they both have Offset=0 and More Fragments=0. With help from David S. Miller Reported-by: NTore Anderson <tore@fud.no> Signed-off-by: NEric Dumazet <edumazet@google.com> Cc: Maciej Żenczykowski <maze@google.com> Cc: Tom Herbert <therbert@google.com> Tested-by: NTore Anderson <tore@fud.no> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Allan Stephens 提交于
Consolidates validation of scope and name sequence range values into a single routine where it applies both to local name publications and to name publications issued by other nodes in the network. This change means that the scope value for non-local publications is now validated and the name sequence range for local publications is now validated only once. Additionally, a publication attempt that fails validation now creates an entry in the system log file only if debugging capabilities have been enabled; this prevents the system log from being cluttered up with messages caused by a defective application or network node. Signed-off-by: NAllan Stephens <allan.stephens@windriver.com> Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
-
由 Allan Stephens 提交于
Replaces two identical chunks of code that delete an unused name sequence structure from TIPC's name table with calls to a new routine that performs this operation. This change is cosmetic and doesn't impact the operation of TIPC. Signed-off-by: NAllan Stephens <allan.stephens@windriver.com> Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
-