- 09 10月, 2018 5 次提交
-
-
由 David Ahern 提交于
Update inet_netconf_dump_devconf, inet6_netconf_dump_devconf, and mpls_netconf_dump_devconf for strict data checking. If the flag is set, the dump request is expected to have an netconfmsg struct as the header. The struct only has the family member and no attributes can be appended. Signed-off-by: NDavid Ahern <dsahern@gmail.com> Acked-by: NChristian Brauner <christian@brauner.io> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 David Ahern 提交于
Add helper to check netlink message for route dumps. If the strict flag is set the dump request is expected to have an rtmsg struct as the header. All elements of the struct are expected to be 0 with the exception of rtm_flags (which is used by both ipv4 and ipv6 dumps) and no attributes can be appended. rtm_flags can only have RTM_F_CLONED and RTM_F_PREFIX set. Update inet_dump_fib, inet6_dump_fib, mpls_dump_routes, ipmr_rtm_dumproute, and ip6mr_rtm_dumproute to call this helper if strict data checking is enabled. Signed-off-by: NDavid Ahern <dsahern@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 David Ahern 提交于
Update ipmr_rtm_dumplink for strict data checking. If the flag is set, the dump request is expected to have an ifinfomsg struct as the header. All elements of the struct are expected to be 0 and no attributes can be appended. Signed-off-by: NDavid Ahern <dsahern@gmail.com> Acked-by: NChristian Brauner <christian@brauner.io> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 David Ahern 提交于
Update inet_dump_ifaddr for strict data checking. If the flag is set, the dump request is expected to have an ifaddrmsg struct as the header potentially followed by one or more attributes. Any data passed in the header or as an attribute is taken as a request to influence the data returned. Only values supported by the dump handler are allowed to be non-0 or set in the request. At the moment only the IFA_TARGET_NETNSID attribute is supported. Follow on patches can support for other fields (e.g., honor ifa_index and only return data for the given device index). Signed-off-by: NDavid Ahern <dsahern@gmail.com> Acked-by: NChristian Brauner <christian@brauner.io> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 David Ahern 提交于
Make sure extack is passed to nlmsg_parse where easy to do so. Most of these are dump handlers and leveraging the extack in the netlink_callback. Signed-off-by: NDavid Ahern <dsahern@gmail.com> Acked-by: NChristian Brauner <christian@brauner.io> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 06 10月, 2018 1 次提交
-
-
由 Willem de Bruijn 提交于
Avoid the socket lookup cost in udp_gro_receive if no socket has a udp tunnel callback configured. udp_sk(sk)->gro_receive requires a registration with setup_udp_tunnel_sock, which enables the static key. Signed-off-by: NWillem de Bruijn <willemb@google.com> Acked-by: NPaolo Abeni <pabeni@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 05 10月, 2018 4 次提交
-
-
由 David Ahern 提交于
Move the refcounting and potential free of dst metrics associated for ipv4 and ipv6 to a common helper. Signed-off-by: NDavid Ahern <dsahern@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 David Ahern 提交于
ipv4 and ipv6 both use refcounted metrics if FIB entries have metrics set. Move the common initialization code to a helper and use for both protocols. Signed-off-by: NDavid Ahern <dsahern@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 David Ahern 提交于
Move the refcounting and potential free of dst metrics associated with a fib entry to a helper and use it in both ipv4 and ipv6. Signed-off-by: NDavid Ahern <dsahern@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 David Ahern 提交于
Consolidate initialization of ipv4 and ipv6 metrics when fib entries are created into a single helper, ip_fib_metrics_init, that handles the call to ip_metrics_convert. If no metrics are defined for the fib entry, then the metrics is set to dst_default_metrics. Signed-off-by: NDavid Ahern <dsahern@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 03 10月, 2018 6 次提交
-
-
由 Eric Dumazet 提交于
Caching ip_hdr(skb) before a call to pskb_may_pull() is buggy, do not do it. Fixes: 2efd4fca ("ip: in cmsg IP(V6)_ORIGDSTADDR call pskb_may_pull") Signed-off-by: NEric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Reported-by: Nsyzbot <syzkaller@googlegroups.com> Acked-by: NWillem de Bruijn <willemb@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Robert Shearman 提交于
It is useful to be able to use the same socket for listening in a specific VRF, as for sending multicast packets out of a specific interface. However, the bound device on the socket currently takes precedence and results in the packets not being sent. Relax the condition on overriding the output interface to use for sending packets out of UDP, raw and ping sockets to allow multicast packets to be sent using the specified multicast interface. Signed-off-by: NRobert Shearman <rshearma@vyatta.att-mail.com> Signed-off-by: NMike Manning <mmanning@vyatta.att-mail.com> Reviewed-by: NDavid Ahern <dsahern@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
syzkaller was able to hit the WARN_ON(sock_owned_by_user(sk)); in tcp_close() While a socket is being closed, it is very possible other threads find it in rtnetlink dump. tcp_get_info() will acquire the socket lock for a short amount of time (slow = lock_sock_fast(sk)/unlock_sock_fast(sk, slow);), enough to trigger the warning. Fixes: 67db3e4b ("tcp: no longer hold ehash lock while calling tcp_get_info()") Signed-off-by: NEric Dumazet <edumazet@google.com> Reported-by: Nsyzbot <syzkaller@googlegroups.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Maciej Żenczykowski 提交于
Signed-off-by: NMaciej Żenczykowski <maze@google.com> Reviewed-by: NDavid Ahern <dsahern@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Maciej Żenczykowski 提交于
(allows for better compiler optimization) Signed-off-by: NMaciej Żenczykowski <maze@google.com> Reviewed-by: NDavid Ahern <dsahern@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
Timer handlers do not imply rcu_read_lock(), so my recent fix triggered a LOCKDEP warning when SYNACK is retransmit. Lets add rcu_read_lock()/rcu_read_unlock() pairs around ireq->ireq_opt usages instead of guessing what is done by callers, since it is not worth the pain. Get rid of ireq_opt_deref() helper since it hides the logic without real benefit, since it is now a standard rcu_dereference(). Fixes: 1ad98e9d ("tcp/dccp: fix lockdep issue when SYN is backlogged") Signed-off-by: NEric Dumazet <edumazet@google.com> Reported-by: NWillem de Bruijn <willemb@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 02 10月, 2018 5 次提交
-
-
由 Eric Dumazet 提交于
In the recent TCP/EDT patch series, I switched TCP and sch_fq clocks from MONOTONIC to TAI, in order to meet the choice done earlier for sch_etf packet scheduler. But sure enough, this broke some setups were the TAI clock jumps forward (by almost 50 year...), as reported by Leonard Crestez. If we want to converge later, we'll probably need to add an skb field to differentiate the clock bases, or a socket option. In the meantime, an UDP application will need to use CLOCK_MONOTONIC base for its SCM_TXTIME timestamps if using fq packet scheduler. Fixes: 72b0094f ("tcp: switch tcp_clock_ns() to CLOCK_TAI base") Fixes: 142537e4 ("net_sched: sch_fq: switch to CLOCK_TAI") Fixes: fd2bca2a ("tcp: switch internal pacing timer to CLOCK_TAI") Signed-off-by: NEric Dumazet <edumazet@google.com> Reported-by: NLeonard Crestez <leonard.crestez@nxp.com> Tested-by: NLeonard Crestez <leonard.crestez@nxp.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Soheil Hassas Yeganeh 提交于
When SKBs are coalesced, we can have SKBs with different frag sizes. Some with PAGE_SIZE and some not with PAGE_SIZE. Since recv_skip_hint is always set to the full SKB size, it can overestimate the amount that should be read using normal read for coalesced packets. Change the recv_skip_hint so that it only includes the first frags that are not of PAGE_SIZE. Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Soheil Hassas Yeganeh 提交于
When we have less than PAGE_SIZE of data on receive queue, we set recv_skip_hint to 0. Instead, set it to the actual number of bytes available. Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Yuchung Cheng 提交于
Previously receiver buffer auto-tuning starts after receiving one advertised window amount of data. After the initial receiver buffer was raised by patch a337531b ("tcp: up initial rmem to 128KB and SYN rwin to around 64KB"), the reciver buffer may take too long to start raising. To address this issue, this patch lowers the initial bytes expected to receive roughly the expected sender's initial window. Fixes: a337531b ("tcp: up initial rmem to 128KB and SYN rwin to around 64KB") Signed-off-by: NYuchung Cheng <ycheng@google.com> Signed-off-by: NWei Wang <weiwan@google.com> Signed-off-by: NNeal Cardwell <ncardwell@google.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Reviewed-by: NSoheil Hassas Yeganeh <soheil@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
In normal SYN processing, packets are handled without listener lock and in RCU protected ingress path. But syzkaller is known to be able to trick us and SYN packets might be processed in process context, after being queued into socket backlog. In commit 06f877d6 ("tcp/dccp: fix other lockdep splats accessing ireq_opt") I made a very stupid fix, that happened to work mostly because of the regular path being RCU protected. Really the thing protecting ireq->ireq_opt is RCU read lock, and the pseudo request refcnt is not relevant. This patch extends what I did in commit 449809a6 ("tcp/dccp: block BH for SYN processing") by adding an extra rcu_read_{lock|unlock} pair in the paths that might be taken when processing SYN from socket backlog (thus possibly in process context) Fixes: 06f877d6 ("tcp/dccp: fix other lockdep splats accessing ireq_opt") Signed-off-by: NEric Dumazet <edumazet@google.com> Reported-by: Nsyzbot <syzkaller@googlegroups.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 30 9月, 2018 1 次提交
-
-
由 Yuchung Cheng 提交于
Previously TCP initial receive buffer is ~87KB by default and the initial receive window is ~29KB (20 MSS). This patch changes the two numbers to 128KB and ~64KB (rounding down to the multiples of MSS) respectively. The patch also simplifies the calculations s.t. the two numbers are directly controlled by sysctl tcp_rmem[1]: 1) Initial receiver buffer budget (sk_rcvbuf): while this should be configured via sysctl tcp_rmem[1], previously tcp_fixup_rcvbuf() always override and set a larger size when a new connection establishes. 2) Initial receive window in SYN: previously it is set to 20 packets if MSS <= 1460. The number 20 was based on the initial congestion window of 10: the receiver needs twice amount to avoid being limited by the receive window upon out-of-order delivery in the first window burst. But since this only applies if the receiving MSS <= 1460, connection using large MTU (e.g. to utilize receiver zero-copy) may be limited by the receive window. With this patch TCP memory configuration is more straight-forward and more properly sized to modern high-speed networks by default. Several popular stacks have been announcing 64KB rwin in SYNs as well. Signed-off-by: NYuchung Cheng <ycheng@google.com> Signed-off-by: NWei Wang <weiwan@google.com> Signed-off-by: NNeal Cardwell <ncardwell@google.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Reviewed-by: NSoheil Hassas Yeganeh <soheil@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 27 9月, 2018 3 次提交
-
-
由 Maciej Żenczykowski 提交于
(fix documentation and sysctl access to treat it as such) Tested: # zcat /proc/config.gz | egrep ^CONFIG_HZ CONFIG_HZ_1000=y CONFIG_HZ=1000 # echo $[(1<<32)/1000 + 1] | tee /proc/sys/net/ipv4/tcp_probe_interval 4294968 tee: /proc/sys/net/ipv4/tcp_probe_interval: Invalid argument # echo $[(1<<32)/1000] | tee /proc/sys/net/ipv4/tcp_probe_interval 4294967 # echo 0 | tee /proc/sys/net/ipv4/tcp_probe_interval # echo -1 | tee /proc/sys/net/ipv4/tcp_probe_interval -1 tee: /proc/sys/net/ipv4/tcp_probe_interval: Invalid argument Signed-off-by: NMaciej Żenczykowski <maze@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Maciej Żenczykowski 提交于
(the parameters in question are mark and flow_flags) Reviewed-by: NDavid Ahern <dsahern@gmail.com> Signed-off-by: NMaciej Żenczykowski <maze@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Maciej Żenczykowski 提交于
(the parameters in question are mark and flow_flags) Reviewed-by: NDavid Ahern <dsahern@gmail.com> Signed-off-by: NMaciej Żenczykowski <maze@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 25 9月, 2018 1 次提交
-
-
由 Paolo Abeni 提交于
Cong noted that we need the same checks introduced by commit 76c0ddd8 ("ip6_tunnel: be careful when accessing the inner header") even for ipv4 tunnels. Fixes: c5441932 ("GRE: Refactor GRE tunneling code.") Suggested-by: NCong Wang <xiyou.wangcong@gmail.com> Signed-off-by: NPaolo Abeni <pabeni@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 22 9月, 2018 9 次提交
-
-
由 Peter Oskolkov 提交于
Currently, ip[6]frag_high_thresh sysctl values in new namespaces are hard-limited to those of the root/init ns. There are at least two use cases when it would be desirable to set the high_thresh values higher in a child namespace vs the global hard limit: - a security/ddos protection policy may lower the thresholds in the root/init ns but allow for a special exception in a child namespace - testing: a test running in a namespace may want to set these thresholds higher in its namespace than what is in the root/init ns The new behavior: # ip netns add testns # ip netns exec testns bash # sysctl -w net.ipv4.ipfrag_high_thresh=9000000 net.ipv4.ipfrag_high_thresh = 9000000 # sysctl net.ipv4.ipfrag_high_thresh net.ipv4.ipfrag_high_thresh = 9000000 # sysctl -w net.ipv6.ip6frag_high_thresh=9000000 net.ipv6.ip6frag_high_thresh = 9000000 # sysctl net.ipv6.ip6frag_high_thresh net.ipv6.ip6frag_high_thresh = 9000000 The old behavior: # ip netns add testns # ip netns exec testns bash # sysctl -w net.ipv4.ipfrag_high_thresh=9000000 net.ipv4.ipfrag_high_thresh = 9000000 # sysctl net.ipv4.ipfrag_high_thresh net.ipv4.ipfrag_high_thresh = 4194304 # sysctl -w net.ipv6.ip6frag_high_thresh=9000000 net.ipv6.ip6frag_high_thresh = 9000000 # sysctl net.ipv6.ip6frag_high_thresh net.ipv6.ip6frag_high_thresh = 4194304 Signed-off-by: NPeter Oskolkov <posk@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
net/ipv4/fib_frontend.c: In function 'fib_info_nh_uses_dev': net/ipv4/fib_frontend.c:322:6: error: unused variable 'ret' [-Werror=unused-variable] cc1: all warnings being treated as errors Fixes: 78f2756c ("net/ipv4: Move device validation to helper") Signed-off-by: NEric Dumazet <edumazet@google.com> Cc: David Ahern <dsahern@gmail.com> Reviewed-by: NDavid Ahern <dsahern@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
Now TCP keeps track of tcp_wstamp_ns, recording the earliest departure time of next packet, we can remove duplicate code from tcp_internal_pacing() This removes one ktime_get_tai_ns() call, and a divide. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
TCP keeps track of tcp_wstamp_ns by itself, meaning sch_fq no longer has to do it. Thanks to this model, TCP can get more accurate RTT samples, since pacing no longer inflates them. This has the nice effect of removing some delays caused by FQ quantum mechanism, causing inflated max/P99 latencies. Also we might relax TCP Small Queue tight limits in the future, since this new model allow TCP to build bigger batches, since sch_fq (or a device with earliest departure time offload) ensure these packets will be delivered on time. Note that other protocols are not converted (they will probably never be) so sch_fq has still support for SO_MAX_PACING_RATE Tested: Test showing FQ pacing quantum artifact for low-rate flows, adding unexpected throttles for RPC flows, inflating max and P99 latencies. The parameters chosen here are to show what happens typically when a TCP flow has a reduced pacing rate (this can be caused by a reduced cwin after few losses, or/and rtt above few ms) MIBS="MIN_LATENCY,MEAN_LATENCY,MAX_LATENCY,P99_LATENCY,STDDEV_LATENCY" Before : $ netperf -H 10.246.7.133 -t TCP_RR -Cc -T6,6 -- -q 2000000 -r 100,100 -o $MIBS MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.246.7.133 () port 0 AF_INET : first burst 0 : cpu bind Minimum Latency Microseconds,Mean Latency Microseconds,Maximum Latency Microseconds,99th Percentile Latency Microseconds,Stddev Latency Microseconds 19,82.78,5279,3825,482.02 After : $ netperf -H 10.246.7.133 -t TCP_RR -Cc -T6,6 -- -q 2000000 -r 100,100 -o $MIBS MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.246.7.133 () port 0 AF_INET : first burst 0 : cpu bind Minimum Latency Microseconds,Mean Latency Microseconds,Maximum Latency Microseconds,99th Percentile Latency Microseconds,Stddev Latency Microseconds 20,49.94,128,63,3.18 Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
Next patch will use tcp_wstamp_ns to feed internal TCP pacing timer, so switch to CLOCK_TAI to share same base. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
Switch internal TCP skb->skb_mstamp to skb->skb_mstamp_ns, from usec units to nsec units. Do not clear skb->tstamp before entering IP stacks in TX, so that qdisc or devices can implement pacing based on the earliest departure time instead of socket sk->sk_pacing_rate Packets are fed with tcp_wstamp_ns, and following patch will update tcp_wstamp_ns when both TCP and sch_fq switch to the earliest departure time mechanism. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
TCP will soon provide earliest departure time on TX skbs. It needs to track this in a new variable. tcp_mstamp_refresh() needs to update this variable, and became too big to stay an inline. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
There are few places where TCP reads skb->skb_mstamp expecting a value in usec unit. skb->tstamp (aka skb->skb_mstamp) will soon store CLOCK_TAI nsec value. Add tcp_skb_timestamp_us() to provide proper conversion when needed. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 zhong jiang 提交于
kfree_skb has taken the null pointer into account. hence it is safe to remove the redundant null pointer check before kfree_skb. Signed-off-by: Nzhong jiang <zhongjiang@huawei.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 21 9月, 2018 3 次提交
-
-
由 David Ahern 提交于
Convert nft_fib4_eval to the new device checking helper and remove the duplicate code. Signed-off-by: NDavid Ahern <dsahern@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 David Ahern 提交于
Convert rpfilter_lookup_reverse to the new device checking helper and remove the duplicate code. Signed-off-by: NDavid Ahern <dsahern@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 David Ahern 提交于
Move the device matching check in __fib_validate_source to a helper and export it for use by netfilter modules. Code move only; no functional change intended. Signed-off-by: NDavid Ahern <dsahern@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 18 9月, 2018 1 次提交
-
-
由 Stefan Nuernberger 提交于
commit 40413955 ("Cipso: cipso_v4_optptr enter infinite loop") fixed a possible infinite loop in the IP option parsing of CIPSO. The fix assumes that ip_options_compile filtered out all zero length options and that no other one-byte options beside IPOPT_END and IPOPT_NOOP exist. While this assumption currently holds true, add explicit checks for zero length and invalid length options to be safe for the future. Even though ip_options_compile should have validated the options, the introduction of new one-byte options can still confuse this code without the additional checks. Signed-off-by: NStefan Nuernberger <snu@amazon.com> Cc: David Woodhouse <dwmw@amazon.co.uk> Cc: Simon Veith <sveith@amazon.de> Cc: stable@vger.kernel.org Acked-by: NPaul Moore <paul@paul-moore.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 17 9月, 2018 1 次提交
-
-
由 Haishuang Yan 提交于
gre_parse_header stops parsing when csum_err is encountered, which means tpi->key is undefined and ip_tunnel_lookup will return NULL improperly. This patch introduce a NULL pointer as csum_err parameter. Even when csum_err is encountered, it won't return error and continue parsing gre header as expected. Fixes: 9f57c67c ("gre: Remove support for sharing GRE protocol hook.") Reported-by: NJiri Benc <jbenc@redhat.com> Signed-off-by: NHaishuang Yan <yanhaishuang@cmss.chinamobile.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-