- 02 7月, 2017 1 次提交
-
-
由 Lawrence Brakmo 提交于
Added support for changing congestion control for SOCK_OPS bpf programs through the setsockopt bpf helper function. It also adds a new SOCK_OPS op, BPF_SOCK_OPS_NEEDS_ECN, that is needed for congestion controls, like dctcp, that need to enable ECN in the SYN packets. Signed-off-by: NLawrence Brakmo <brakmo@fb.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 01 7月, 2017 1 次提交
-
-
由 Reshetova, Elena 提交于
refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: NElena Reshetova <elena.reshetova@intel.com> Signed-off-by: NHans Liljestrand <ishkamiel@gmail.com> Signed-off-by: NKees Cook <keescook@chromium.org> Signed-off-by: NDavid Windsor <dwindsor@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 28 6月, 2017 1 次提交
-
-
由 Dave Watson 提交于
If icsk_ulp_ops is unset, it dereferences a null ptr. Add a null ptr check. BUG: KASAN: null-ptr-deref in copy_to_user include/linux/uaccess.h:168 [inline] BUG: KASAN: null-ptr-deref in do_tcp_getsockopt.isra.33+0x24f/0x1e30 net/ipv4/tcp.c:3057 Read of size 4 at addr 0000000000000020 by task syz-executor1/15452 Signed-off-by: NDave Watson <davejwatson@fb.com> Reported-by: N"Levin, Alexander (Sasha Levin)" <alexander.levin@verizon.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 26 6月, 2017 1 次提交
-
-
由 WANG Cong 提交于
We have to reset the sk->sk_rx_dst when we disconnect a TCP connection, because otherwise when we re-connect it this dst reference is simply overridden in tcp_finish_connect(). This fixes a dst leak which leads to a loopback dev refcnt leak. It is a long-standing bug, Kevin reported a very similar (if not same) bug before. Thanks to Andrei for providing such a reliable reproducer which greatly narrows down the problem. Fixes: 41063e9d ("ipv4: Early TCP socket demux.") Reported-by: NAndrei Vagin <avagin@gmail.com> Reported-by: NKevin Xu <kaiwen.xu@hulu.com> Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 20 6月, 2017 1 次提交
-
-
由 Ivan Delalande 提交于
Replace first padding in the tcp_md5sig structure with a new flag field and address prefix length so it can be specified when configuring a new key for TCP MD5 signature. The tcpm_flags field will only be used if the socket option is TCP_MD5SIG_EXT to avoid breaking existing programs, and tcpm_prefixlen only when the TCP_MD5SIG_FLAG_PREFIX flag is set. Signed-off-by: NBob Gilligan <gilligan@arista.com> Signed-off-by: NEric Mowat <mowat@arista.com> Signed-off-by: NIvan Delalande <colona@arista.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 16 6月, 2017 2 次提交
-
-
由 Dave Watson 提交于
Export do_tcp_sendpages and tcp_rate_check_app_limited, since tls will need to sendpages while the socket is already locked. tcp_sendpage is exported, but requires the socket lock to not be held already. Signed-off-by: NAviad Yehezkel <aviadye@mellanox.com> Signed-off-by: NIlya Lesokhin <ilyal@mellanox.com> Signed-off-by: NBoris Pismenny <borisp@mellanox.com> Signed-off-by: NDave Watson <davejwatson@fb.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Dave Watson 提交于
Add the infrustructure for attaching Upper Layer Protocols (ULPs) over TCP sockets. Based on a similar infrastructure in tcp_cong. The idea is that any ULP can add its own logic by changing the TCP proto_ops structure to its own methods. Example usage: setsockopt(sock, SOL_TCP, TCP_ULP, "tls", sizeof("tls")); modules will call: tcp_register_ulp(&tcp_tls_ulp_ops); to register/unregister their ulp, with an init function and name. A list of registered ulps will be returned by tcp_get_available_ulp, which is hooked up to /proc. Example: $ cat /proc/sys/net/ipv4/tcp_available_ulp tls There is currently no functionality to remove or chain ULPs, but it should be possible to add these in the future if needed. Signed-off-by: NBoris Pismenny <borisp@mellanox.com> Signed-off-by: NDave Watson <davejwatson@fb.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 08 6月, 2017 1 次提交
-
-
由 Eric Dumazet 提交于
DRAM supply shortage and poor memory pressure tracking in TCP stack makes any change in SO_SNDBUF/SO_RCVBUF (or equivalent autotuning limits) and tcp_mem[] quite hazardous. TCPMemoryPressures SNMP counter is an indication of tcp_mem sysctl limits being hit, but only tracking number of transitions. If TCP stack behavior under stress was perfect : 1) It would maintain memory usage close to the limit. 2) Memory pressure state would be entered for short times. We certainly prefer 100 events lasting 10ms compared to one event lasting 200 seconds. This patch adds a new SNMP counter tracking cumulative duration of memory pressure events, given in ms units. $ cat /proc/sys/net/ipv4/tcp_mem 3088 4117 6176 $ grep TCP /proc/net/sockstat TCP: inuse 180 orphan 0 tw 2 alloc 234 mem 4140 $ nstat -n ; sleep 10 ; nstat |grep Pressure TcpExtTCPMemoryPressures 1700 TcpExtTCPMemoryPressuresChrono 5209 v2: Used EXPORT_SYMBOL_GPL() instead of EXPORT_SYMBOL() as David instructed. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 01 6月, 2017 1 次提交
-
-
MTU probing initialization occurred only at connect() and at SYN or SYN-ACK reception, but the former sets MSS to either the default or the user set value (through TCP_MAXSEG sockopt) and the latter never happens with repaired sockets. The result was that, with MTU probing enabled and unless TCP_MAXSEG sockopt was used before connect(), probing would be stuck at tcp_base_mss value until tcp_probe_interval seconds have passed. Signed-off-by: NDouglas Caetano dos Santos <douglascs@taghos.com.br> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 26 5月, 2017 1 次提交
-
-
由 Wei Wang 提交于
Fastopen API should be used to perform fastopen operations on the TCP socket. It does not make sense to use fastopen API to perform disconnect by calling it with AF_UNSPEC. The fastopen data path is also prone to race conditions and bugs when using with AF_UNSPEC. One issue reported and analyzed by Vegard Nossum is as follows: +++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Thread A: Thread B: ------------------------------------------------------------------------ sendto() - tcp_sendmsg() - sk_stream_memory_free() = 0 - goto wait_for_sndbuf - sk_stream_wait_memory() - sk_wait_event() // sleep | sendto(flags=MSG_FASTOPEN, dest_addr=AF_UNSPEC) | - tcp_sendmsg() | - tcp_sendmsg_fastopen() | - __inet_stream_connect() | - tcp_disconnect() //because of AF_UNSPEC | - tcp_transmit_skb()// send RST | - return 0; // no reconnect! | - sk_stream_wait_connect() | - sock_error() | - xchg(&sk->sk_err, 0) | - return -ECONNRESET - ... // wake up, see sk->sk_err == 0 - skb_entail() on TCP_CLOSE socket If the connection is reopened then we will send a brand new SYN packet after thread A has already queued a buffer. At this point I think the socket internal state (sequence numbers etc.) becomes messed up. When the new connection is closed, the FIN-ACK is rejected because the sequence number is outside the window. The other side tries to retransmit, but __tcp_retransmit_skb() calls tcp_trim_head() on an empty skb which corrupts the skb data length and hits a BUG() in copy_and_csum_bits(). +++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Hence, this patch adds a check for AF_UNSPEC in the fastopen data path and return EOPNOTSUPP to user if such case happens. Fixes: cf60af03 ("tcp: Fast Open client - sendmsg(MSG_FASTOPEN)") Reported-by: NVegard Nossum <vegard.nossum@oracle.com> Signed-off-by: NWei Wang <weiwan@google.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 23 5月, 2017 1 次提交
-
-
由 Rohit Chavan 提交于
Fixed a coding style issue Signed-off-by: NRohit Chavan <roheetchavan@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 22 5月, 2017 1 次提交
-
-
由 Wei Wang 提交于
When tcp_disconnect() is called, inet_csk_delack_init() sets icsk->icsk_ack.rcv_mss to 0. This could potentially cause tcp_recvmsg() => tcp_cleanup_rbuf() => __tcp_select_window() call path to have division by 0 issue. So this patch initializes rcv_mss to TCP_MIN_MSS instead of 0. Reported-by: NAndrey Konovalov <andreyknvl@google.com> Signed-off-by: NWei Wang <weiwan@google.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NNeal Cardwell <ncardwell@google.com> Signed-off-by: NYuchung Cheng <ycheng@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 18 5月, 2017 4 次提交
-
-
由 Eric Dumazet 提交于
TCP Timestamps option is defined in RFC 7323 Traditionally on linux, it has been tied to the internal 'jiffies' variable, because it had been a cheap and good enough generator. For TCP flows on the Internet, 1 ms resolution would be much better than 4ms or 10ms (HZ=250 or HZ=100 respectively) For TCP flows in the DC, Google has used usec resolution for more than two years with great success [1] Receive size autotuning (DRS) is indeed more precise and converges faster to optimal window size. This patch converts tp->tcp_mstamp to a plain u64 value storing a 1 usec TCP clock. This choice will allow us to upstream the 1 usec TS option as discussed in IETF 97. [1] https://www.ietf.org/proceedings/97/slides/slides-97-tcpm-tcp-options-for-low-latency-00.pdfSigned-off-by: NEric Dumazet <edumazet@google.com> Acked-by: NSoheil Hassas Yeganeh <soheil@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
After this patch, all uses of tcp_time_stamp will require a change when we introduce 1 ms and/or 1 us TCP TS option. Signed-off-by: NEric Dumazet <edumazet@google.com> Acked-by: NSoheil Hassas Yeganeh <soheil@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
tcp_time_stamp will no longer be tied to jiffies. Signed-off-by: NEric Dumazet <edumazet@google.com> Acked-by: NSoheil Hassas Yeganeh <soheil@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
Use tcp_jiffies32 instead of tcp_time_stamp to feed tp->lsndtime. tcp_time_stamp will soon be a litle bit more expensive than simply reading 'jiffies'. Signed-off-by: NEric Dumazet <edumazet@google.com> Acked-by: NSoheil Hassas Yeganeh <soheil@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 01 5月, 2017 1 次提交
-
-
由 Davide Caratti 提交于
avoid direct access to sk->sk_state when tcp_poll() is called on a socket using active TCP fastopen with deferred connect. Use local variable 'state', which stores the result of sk_state_load(), like it was done in commit 00fd38d9 ("tcp: ensure proper barriers in lockless contexts"). Fixes: 19f6d3f3 ("net/tcp-fastopen: Add new API support") Signed-off-by: NDavide Caratti <dcaratti@redhat.com> Acked-by: NWei Wang <weiwan@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 27 4月, 2017 1 次提交
-
-
由 Eric Dumazet 提交于
Some devices or distributions use HZ=100 or HZ=250 TCP receive buffer autotuning has poor behavior caused by this choice. Since autotuning happens after 4 ms or 10 ms, short distance flows get their receive buffer tuned to a very high value, but after an initial period where it was frozen to (too small) initial value. With tp->tcp_mstamp introduction, we can switch to high resolution timestamps almost for free (at the expense of 8 additional bytes per TCP structure) Note that some TCP stacks use usec TCP timestamps where this patch makes even more sense : Many TCP flows have < 500 usec RTT. Hopefully this finer TS option can be standardized soon. Tested: HZ=100 kernel ./netperf -H lpaa24 -t TCP_RR -l 1000 -- -r 10000,10000 & Peer without patch : lpaa24:~# ss -tmi dst lpaa23 ... skmem:(r0,rb8388608,...) rcv_rtt:10 rcv_space:3210000 minrtt:0.017 Peer with the patch : lpaa23:~# ss -tmi dst lpaa24 ... skmem:(r0,rb428800,...) rcv_rtt:0.069 rcv_space:30000 minrtt:0.017 We can see saner RCVBUF, and more precise rcv_rtt information. Signed-off-by: NEric Dumazet <edumazet@google.com> Acked-by: NSoheil Hassas Yeganeh <soheil@google.com> Acked-by: NNeal Cardwell <ncardwell@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 25 4月, 2017 1 次提交
-
-
由 Wei Wang 提交于
Middlebox firewall issues can potentially cause server's data being blackholed after a successful 3WHS using TFO. Following are the related reports from Apple: https://www.nanog.org/sites/default/files/Paasch_Network_Support.pdf Slide 31 identifies an issue where the client ACK to the server's data sent during a TFO'd handshake is dropped. C ---> syn-data ---> S C <--- syn/ack ----- S C (accept & write) C <---- data ------- S C ----- ACK -> X S [retry and timeout] https://www.ietf.org/proceedings/94/slides/slides-94-tcpm-13.pdf Slide 5 shows a similar situation that the server's data gets dropped after 3WHS. C ---- syn-data ---> S C <--- syn/ack ----- S C ---- ack --------> S S (accept & write) C? X <- data ------ S [retry and timeout] This is the worst failure b/c the client can not detect such behavior to mitigate the situation (such as disabling TFO). Failing to proceed, the application (e.g., SSL library) may simply timeout and retry with TFO again, and the process repeats indefinitely. The proposed solution is to disable active TFO globally under the following circumstances: 1. client side TFO socket detects out of order FIN 2. client side TFO socket receives out of order RST We disable active side TFO globally for 1hr at first. Then if it happens again, we disable it for 2h, then 4h, 8h, ... And we reset the timeout to 1hr if a client side TFO sockets not opened on loopback has successfully received data segs from server. And we examine this condition during close(). The rational behind it is that when such firewall issue happens, application running on the client should eventually close the socket as it is not able to get the data it is expecting. Or application running on the server should close the socket as it is not able to receive any response from client. In both cases, out of order FIN or RST will get received on the client given that the firewall will not block them as no data are in those frames. And we want to disable active TFO globally as it helps if the middle box is very close to the client and most of the connections are likely to fail. Also, add a debug sysctl: tcp_fastopen_blackhole_detect_timeout_sec: the initial timeout to use when firewall blackhole issue happens. This can be set and read. When setting it to 0, it means to disable the active disable logic. Signed-off-by: NWei Wang <weiwan@google.com> Acked-by: NYuchung Cheng <ycheng@google.com> Acked-by: NNeal Cardwell <ncardwell@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 10 4月, 2017 1 次提交
-
-
由 Eric Dumazet 提交于
In the (very unlikely) case a passive socket becomes a listener, we do not want to duplicate its saved SYN headers. This would lead to double frees, use after free, and please hackers and various fuzzers Tested: 0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 +0 setsockopt(3, IPPROTO_TCP, TCP_SAVE_SYN, [1], 4) = 0 +0 fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0 +0 bind(3, ..., ...) = 0 +0 listen(3, 5) = 0 +0 < S 0:0(0) win 32972 <mss 1460,nop,wscale 7> +0 > S. 0:0(0) ack 1 <...> +.1 < . 1:1(0) ack 1 win 257 +0 accept(3, ..., ...) = 4 +0 connect(4, AF_UNSPEC, ...) = 0 +0 close(3) = 0 +0 bind(4, ..., ...) = 0 +0 listen(4, 5) = 0 +0 < S 0:0(0) win 32972 <mss 1460,nop,wscale 7> +0 > S. 0:0(0) ack 1 <...> +.1 < . 1:1(0) ack 1 win 257 Fixes: cd8ae852 ("tcp: provide SYN headers for passive connections") Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 05 4月, 2017 1 次提交
-
-
由 Gao Feng 提交于
Define one new macro TCP_MAX_WSCALE instead of literal number '14', and use U16_MAX instead of 65535 as the max value of TCP window. There is another minor change, use rounddown(space, mss) instead of (space / mss) * mss; Signed-off-by: NGao Feng <fgao@ikuai8.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 23 3月, 2017 1 次提交
-
-
由 Gao Feng 提交于
When user_mss is zero, it means use the default value. But the current codes don't permit user set TCP_MAXSEG to the default value. It would return the -EINVAL when val is zero. Signed-off-by: NGao Feng <fgao@ikuai8.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 17 3月, 2017 1 次提交
-
-
由 Eric Dumazet 提交于
Commit b369e7fd ("tcp: make TCP_INFO more consistent") moved lock_sock_fast() earlier in tcp_get_info() This has the minor effect that jiffies value being sampled at the beginning of tcp_get_info() is more likely to be off by one, and we report big tcpi_last_data_sent values (like 0xFFFFFFFF). Since we lock the socket, fetching tcp_time_stamp right before doing the jiffies_to_msecs() calls is enough to remove these wrong values. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 03 3月, 2017 1 次提交
-
-
由 Wei Wang 提交于
tp->fastopen_req could potentially be double freed if a malicious user does the following: 1. Enable TCP_FASTOPEN_CONNECT sockopt and do a connect() on the socket. 2. Call connect() with AF_UNSPEC to disconnect the socket. 3. Make this socket a listening socket by calling listen(). 4. Accept incoming connections and generate child sockets. All child sockets will get a copy of the pointer of fastopen_req. 5. Call close() on all sockets. fastopen_req will get freed multiple times. Fixes: 19f6d3f3 ("net/tcp-fastopen: Add new API support") Reported-by: NAndrey Konovalov <andreyknvl@google.com> Signed-off-by: NWei Wang <weiwan@google.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 18 2月, 2017 1 次提交
-
-
由 Eric Dumazet 提交于
sk_page_frag_refill() allocates either a compound page or an order-0 page. We can use page_ref_inc() which is slightly faster than get_page() Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 07 2月, 2017 1 次提交
-
-
由 Eric Dumazet 提交于
Splicing from TCP socket is vulnerable when a packet with URG flag is received and stored into receive queue. __tcp_splice_read() returns 0, and sk_wait_data() immediately returns since there is the problematic skb in queue. This is a nice way to burn cpu (aka infinite loop) and trigger soft lockups. Again, this gem was found by syzkaller tool. Fixes: 9c55e01c ("[TCP]: Splice receive support.") Signed-off-by: NEric Dumazet <edumazet@google.com> Reported-by: NDmitry Vyukov <dvyukov@google.com> Cc: Willy Tarreau <w@1wt.eu> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 30 1月, 2017 1 次提交
-
-
由 Yuchung Cheng 提交于
Add two stats in SCM_TIMESTAMPING_OPT_STATS: TCP_NLA_DATA_SEGS_OUT: total data packets sent including retransmission TCP_NLA_TOTAL_RETRANS: total data packets retransmitted The names are picked to be consistent with corresponding fields in TCP_INFO. This allows applications that are using the timestamping API to measure latency stats to also retrive retransmission rate of application write. Signed-off-by: NYuchung Cheng <ycheng@google.com> Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com> Acked-by: NNeal Cardwell <ncardwell@google.com> Acked-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 26 1月, 2017 2 次提交
-
-
由 Willy Tarreau 提交于
Without TFO, any subsequent connect() call after a successful one returns -1 EISCONN. The last API update ensured that __inet_stream_connect() can return -1 EINPROGRESS in response to sendmsg() when TFO is in use to indicate that the connection is now in progress. Unfortunately since this function is used both for connect() and sendmsg(), it has the undesired side effect of making connect() now return -1 EINPROGRESS as well after a successful call, while at the same time poll() returns POLLOUT. This can confuse some applications which happen to call connect() and to check for -1 EISCONN to ensure the connection is usable, and for which EINPROGRESS indicates a need to poll, causing a loop. This problem was encountered in haproxy where a call to connect() is precisely used in certain cases to confirm a connection's readiness. While arguably haproxy's behaviour should be improved here, it seems important to aim at a more robust behaviour when the goal of the new API is to make it easier to implement TFO in existing applications. This patch simply ensures that we preserve the same semantics as in the non-TFO case on the connect() syscall when using TFO, while still returning -1 EINPROGRESS on sendmsg(). For this we simply tell __inet_stream_connect() whether we're doing a regular connect() or in fact connecting for a sendmsg() call. Cc: Wei Wang <weiwan@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: NWilly Tarreau <w@1wt.eu> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Wei Wang 提交于
This patch adds a new socket option, TCP_FASTOPEN_CONNECT, as an alternative way to perform Fast Open on the active side (client). Prior to this patch, a client needs to replace the connect() call with sendto(MSG_FASTOPEN). This can be cumbersome for applications who want to use Fast Open: these socket operations are often done in lower layer libraries used by many other applications. Changing these libraries and/or the socket call sequences are not trivial. A more convenient approach is to perform Fast Open by simply enabling a socket option when the socket is created w/o changing other socket calls sequence: s = socket() create a new socket setsockopt(s, IPPROTO_TCP, TCP_FASTOPEN_CONNECT …); newly introduced sockopt If set, new functionality described below will be used. Return ENOTSUPP if TFO is not supported or not enabled in the kernel. connect() With cookie present, return 0 immediately. With no cookie, initiate 3WHS with TFO cookie-request option and return -1 with errno = EINPROGRESS. write()/sendmsg() With cookie present, send out SYN with data and return the number of bytes buffered. With no cookie, and 3WHS not yet completed, return -1 with errno = EINPROGRESS. No MSG_FASTOPEN flag is needed. read() Return -1 with errno = EWOULDBLOCK/EAGAIN if connect() is called but write() is not called yet. Return -1 with errno = EWOULDBLOCK/EAGAIN if connection is established but no msg is received yet. Return number of bytes read if socket is established and there is msg received. The new API simplifies life for applications that always perform a write() immediately after a successful connect(). Such applications can now take advantage of Fast Open by merely making one new setsockopt() call at the time of creating the socket. Nothing else about the application's socket call sequence needs to change. Signed-off-by: NWei Wang <weiwan@google.com> Acked-by: NEric Dumazet <edumazet@google.com> Acked-by: NYuchung Cheng <ycheng@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 21 1月, 2017 1 次提交
-
-
由 Eric Dumazet 提交于
Shaohua Li made percpu_counter irq safe in commit 098faf58 ("percpu_counter: make APIs irq safe") We can safely remove BH disable/enable sections around various percpu_counter manipulations. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 14 1月, 2017 2 次提交
-
-
由 Yuchung Cheng 提交于
Thin stream DUPACK is to start fast recovery on only one DUPACK provided the connection is a thin stream (i.e., low inflight). But this older feature is now subsumed with RACK. If a connection receives only a single DUPACK, RACK would arm a reordering timer and soon starts fast recovery instead of timeout if no further ACKs are received. The socket option (THIN_DUPACK) is kept as a nop for compatibility. Note that this patch does not change another thin-stream feature which enables linear RTO. Although it might be good to generalize that in the future (i.e., linear RTO for the first say 3 retries). Signed-off-by: NYuchung Cheng <ycheng@google.com> Signed-off-by: NNeal Cardwell <ncardwell@google.com> Acked-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Yuchung Cheng 提交于
This patch removes the support of RFC5827 early retransmit (i.e., fast recovery on small inflight with <3 dupacks) because it is subsumed by the new RACK loss detection. More specifically when RACK receives DUPACKs, it'll arm a reordering timer to start fast recovery after a quarter of (min)RTT, hence it covers the early retransmit except RACK does not limit itself to specific inflight or dupack numbers. Signed-off-by: NYuchung Cheng <ycheng@google.com> Signed-off-by: NNeal Cardwell <ncardwell@google.com> Acked-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 10 1月, 2017 1 次提交
-
-
由 Eric Dumazet 提交于
tcp_get_info() has to lock the socket, so lets lock it for an extended critical section, so that various fields have consistent values. This solves an annoying issue that some applications reported when multiple counters are updated during one particular rx/rx event, and TCP_INFO was called from another cpu. Signed-off-by: NEric Dumazet <edumazet@google.com> Cc: Yuchung Cheng <ycheng@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 06 1月, 2017 1 次提交
-
-
由 Soheil Hassas Yeganeh 提交于
For TCP sockets, TX timestamps are only captured when the user data is successfully and fully written to the socket. In many cases, however, TCP writes can be partial for which no timestamp is collected. Collect timestamps whenever any user data is (fully or partially) copied into the socket. Pass tcp_write_queue_tail to tcp_tx_timestamp instead of the local skb pointer since it can be set to NULL on the error path. Note that tcp_write_queue_tail can be NULL, even if bytes have been copied to the socket. This is because acknowledgements are being processed in tcp_sendmsg(), and by the time tcp_tx_timestamp is called tcp_write_queue_tail can be NULL. For such cases, this patch does not collect any timestamps (i.e., it is best-effort). This patch is written with suggestions from Willem de Bruijn and Eric Dumazet. Change-log V1 -> V2: - Use sockc.tsflags instead of sk->sk_tsflags. - Use the same code path for normal writes and errors. Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com> Acked-by: NYuchung Cheng <ycheng@google.com> Cc: Willem de Bruijn <willemb@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Martin KaFai Lau <kafai@fb.com> Acked-by: NWillem de Bruijn <willemb@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 30 12月, 2016 2 次提交
-
-
由 Haishuang Yan 提交于
Different namespace application might require different maximal number of remembered connection requests. Signed-off-by: NHaishuang Yan <yanhaishuang@cmss.chinamobile.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Haishuang Yan 提交于
Different namespace application might require fast recycling TIME-WAIT sockets independently of the host. Signed-off-by: NHaishuang Yan <yanhaishuang@cmss.chinamobile.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 25 12月, 2016 1 次提交
-
-
由 Linus Torvalds 提交于
This was entirely automated, using the script by Al: PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*<asm/uaccess.h>' sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \ $(git grep -l "$PATT"|grep -v ^include/linux/uaccess.h) to do the replacement at the end of the merge window. Requested-by: NAl Viro <viro@zeniv.linux.org.uk> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 06 12月, 2016 1 次提交
-
-
由 Eric Dumazet 提交于
tsq_flags being in the same cache line than sk_wmem_alloc makes a lot of sense. Both fields are changed from tcp_wfree() and more generally by various TSQ related functions. Prior patch made room in struct sock and added sk_tsq_flags, this patch deletes tsq_flags from struct tcp_sock. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 30 11月, 2016 2 次提交
-
-
由 Francis Yan 提交于
This patch exports the sender chronograph stats via the socket SO_TIMESTAMPING channel. Currently we can instrument how long a particular application unit of data was queued in TCP by tracking SOF_TIMESTAMPING_TX_SOFTWARE and SOF_TIMESTAMPING_TX_SCHED. Having these sender chronograph stats exported simultaneously along with these timestamps allow further breaking down the various sender limitation. For example, a video server can tell if a particular chunk of video on a connection takes a long time to deliver because TCP was experiencing small receive window. It is not possible to tell before this patch without packet traces. To prepare these stats, the user needs to set SOF_TIMESTAMPING_OPT_STATS and SOF_TIMESTAMPING_OPT_TSONLY flags while requesting other SOF_TIMESTAMPING TX timestamps. When the timestamps are available in the error queue, the stats are returned in a separate control message of type SCM_TIMESTAMPING_OPT_STATS, in a list of TLVs (struct nlattr) of types: TCP_NLA_BUSY_TIME, TCP_NLA_RWND_LIMITED, TCP_NLA_SNDBUF_LIMITED. Unit is microsecond. Signed-off-by: NFrancis Yan <francisyyan@gmail.com> Signed-off-by: NYuchung Cheng <ycheng@google.com> Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com> Acked-by: NNeal Cardwell <ncardwell@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Francis Yan 提交于
This patch exports all the sender chronograph measurements collected in the previous patches to TCP_INFO interface. Note that busy time exported includes all the other sending limits (rwnd-limited, sndbuf-limited). Internally the time unit is jiffy but externally the measurements are in microseconds for future extensions. Signed-off-by: NFrancis Yan <francisyyan@gmail.com> Signed-off-by: NYuchung Cheng <ycheng@google.com> Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com> Acked-by: NNeal Cardwell <ncardwell@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-