- 22 11月, 2016 3 次提交
-
-
由 Florian Westphal 提交于
The undo_cwnd fallback in the stack doubles cwnd based on ssthresh, which un-does reno halving behaviour. It seems more appropriate to let congctl algorithms pair .ssthresh and .undo_cwnd properly. Add a 'tcp_reno_undo_cwnd' function and wire it up for all congestion algorithms that used to rely on the fallback. Cc: Eric Dumazet <edumazet@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Signed-off-by: NFlorian Westphal <fw@strlen.de> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Florian Westphal 提交于
congestion control algorithms that do not halve cwnd in their .ssthresh should provide a .cwnd_undo rather than rely on current fallback which assumes reno halving (and thus doubles the cwnd). All of these do 'something else' in their .ssthresh implementation, thus store the cwnd on loss and provide .undo_cwnd to restore it again. A followup patch will remove the fallback and all algorithms will need to provide a .cwnd_undo function. Signed-off-by: NFlorian Westphal <fw@strlen.de> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
UDP_SKB_CB(skb)->partial_cov is located at offset 66 in skb, requesting a cold cache line being read in cpu cache. We can avoid this cache line miss for UDP sockets, as partial_cov has a meaning only for UDPLite. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 20 11月, 2016 1 次提交
-
-
由 Alexey Dobriyan 提交于
1) cast to "int" is unnecessary: u8 will be promoted to int before decrementing, small positive numbers fit into "int", so their values won't be changed during promotion. Once everything is int including loop counters, signedness doesn't matter: 32-bit operations will stay 32-bit operations. But! Someone tried to make this loop smart by making everything of the same type apparently in an attempt to optimise it. Do the optimization, just differently. Do the cast where it matters. :^) 2) frag size is unsigned entity and sum of fragments sizes is also unsigned. Make everything unsigned, leave no MOVSX instruction behind. add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-4 (-4) function old new delta skb_cow_data 835 834 -1 ip_do_fragment 2549 2548 -1 ip6_fragment 3130 3128 -2 Total: Before=154865032, After=154865028, chg -0.00% Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 18 11月, 2016 2 次提交
-
-
由 Alexey Dobriyan 提交于
Make struct pernet_operations::id unsigned. There are 2 reasons to do so: 1) This field is really an index into an zero based array and thus is unsigned entity. Using negative value is out-of-bound access by definition. 2) On x86_64 unsigned 32-bit data which are mixed with pointers via array indexing or offsets added or subtracted to pointers are preffered to signed 32-bit data. "int" being used as an array index needs to be sign-extended to 64-bit before being used. void f(long *p, int i) { g(p[i]); } roughly translates to movsx rsi, esi mov rdi, [rsi+...] call g MOVSX is 3 byte instruction which isn't necessary if the variable is unsigned because x86_64 is zero extending by default. Now, there is net_generic() function which, you guessed it right, uses "int" as an array index: static inline void *net_generic(const struct net *net, int id) { ... ptr = ng->ptr[id - 1]; ... } And this function is used a lot, so those sign extensions add up. Patch snipes ~1730 bytes on allyesconfig kernel (without all junk messing with code generation): add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730) Unfortunately some functions actually grow bigger. This is a semmingly random artefact of code generation with register allocator being used differently. gcc decides that some variable needs to live in new r8+ registers and every access now requires REX prefix. Or it is shifted into r12, so [r12+0] addressing mode has to be used which is longer than [r8] However, overall balance is in negative direction: add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730) function old new delta nfsd4_lock 3886 3959 +73 tipc_link_build_proto_msg 1096 1140 +44 mac80211_hwsim_new_radio 2776 2808 +32 tipc_mon_rcv 1032 1058 +26 svcauth_gss_legacy_init 1413 1429 +16 tipc_bcbase_select_primary 379 392 +13 nfsd4_exchange_id 1247 1260 +13 nfsd4_setclientid_confirm 782 793 +11 ... put_client_renew_locked 494 480 -14 ip_set_sockfn_get 730 716 -14 geneve_sock_add 829 813 -16 nfsd4_sequence_done 721 703 -18 nlmclnt_lookup_host 708 686 -22 nfsd4_lockt 1085 1063 -22 nfs_get_client 1077 1050 -27 tcf_bpf_init 1106 1076 -30 nfsd4_encode_fattr 5997 5930 -67 Total: Before=154856051, After=154854321, chg -0.00% Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
UDP busy polling is restricted to connected UDP sockets. This is because sk_busy_loop() only takes care of one NAPI context. There are cases where it could be extended. 1) Some hosts receive traffic on a single NIC, with one RX queue. 2) Some applications use SO_REUSEPORT and associated BPF filter to split the incoming traffic on one UDP socket per RX queue/thread/cpu 3) Some UDP sockets are used to send/receive traffic for one flow, but they do not bother with connect() This patch records the napi_id of first received skb, giving more reach to busy polling. Tested: lpaa23:~# echo 70 >/proc/sys/net/core/busy_read lpaa24:~# echo 70 >/proc/sys/net/core/busy_read lpaa23:~# for f in `seq 1 10`; do ./super_netperf 1 -H lpaa24 -t UDP_RR -l 5; done Before patch : 27867 28870 37324 41060 41215 36764 36838 44455 41282 43843 After patch : 73920 73213 70147 74845 71697 68315 68028 75219 70082 73707 Signed-off-by: NEric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 16 11月, 2016 3 次提交
-
-
由 Andrey Vagin 提交于
The repair mode is used to get and restore sequence numbers and data from queues. It used to checkpoint/restore connections. Currently the repair mode can be enabled for sockets in the established and closed states, but for other states we have to dump the same socket properties, so lets allow to enable repair mode for these sockets. The repair mode reveals nothing more for sockets in other states. Signed-off-by: NAndrei Vagin <avagin@openvz.org> Acked-by: NPavel Emelyanov <xemul@openvz.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Florian Westphal 提交于
draft-ietf-tcpm-dctcp-02 says: ... when the sender receives an indication of congestion (ECE), the sender SHOULD update cwnd as follows: cwnd = cwnd * (1 - DCTCP.Alpha / 2) So, lets do this and reduce cwnd more smoothly (and faster), as per current congestion estimate. Cc: Lawrence Brakmo <brakmo@fb.com> Cc: Andrew Shewmaker <agshew@gmail.com> Cc: Glenn Judd <glenn.judd@morganstanley.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: NFlorian Westphal <fw@strlen.de> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Paolo Abeni 提交于
The commit 850cbadd ("udp: use it's own memory accounting schema") assumes that the socket proto has memory accounting enabled, but this is not the case for UDPLITE. Fix it enabling memory accounting for UDPLITE and performing fwd allocated memory reclaiming on socket shutdown. UDP and UDPLITE share now the same memory accounting limits. Also drop the backlog receive operation, since is no more needed. Fixes: 850cbadd ("udp: use it's own memory accounting schema") Reported-by: NAndrei Vagin <avagin@gmail.com> Suggested-by: NEric Dumazet <eric.dumazet@gmail.com> Signed-off-by: NPaolo Abeni <pabeni@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 14 11月, 2016 3 次提交
-
-
由 Julia Lawall 提交于
Since commit 7926dbfa ("netfilter: don't use mutex_lock_interruptible()"), the function xt_find_table_lock can only return NULL on an error. Simplify the call sites and update the comment before the function. The semantic patch that change the code is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ expression t,e; @@ t = \(xt_find_table_lock(...)\| try_then_request_module(xt_find_table_lock(...),...)\) ... when != t=e - ! IS_ERR_OR_NULL(t) + t @@ expression t,e; @@ t = \(xt_find_table_lock(...)\| try_then_request_module(xt_find_table_lock(...),...)\) ... when != t=e - IS_ERR_OR_NULL(t) + !t @@ expression t,e,e1; @@ t = \(xt_find_table_lock(...)\| try_then_request_module(xt_find_table_lock(...),...)\) ... when != t=e ?- t ? PTR_ERR(t) : e1 + e1 ... when any // </smpl> Signed-off-by: NJulia Lawall <Julia.Lawall@lip6.fr> Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
-
由 Eric Dumazet 提交于
With syzkaller help, Marco Grassi found a bug in TCP stack, crashing in tcp_collapse() Root cause is that sk_filter() can truncate the incoming skb, but TCP stack was not really expecting this to happen. It probably was expecting a simple DROP or ACCEPT behavior. We first need to make sure no part of TCP header could be removed. Then we need to adjust TCP_SKB_CB(skb)->end_seq Many thanks to syzkaller team and Marco for giving us a reproducer. Signed-off-by: NEric Dumazet <edumazet@google.com> Reported-by: NMarco Grassi <marco.gra@gmail.com> Reported-by: NVladis Dronov <vdronov@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Stephen Suryaputra Lin 提交于
In v2.6, ip_rt_redirect() calls arp_bind_neighbour() which returns 0 and then the state of the neigh for the new_gw is checked. If the state isn't valid then the redirected route is deleted. This behavior is maintained up to v3.5.7 by check_peer_redirect() because rt->rt_gateway is assigned to peer->redirect_learned.a4 before calling ipv4_neigh_lookup(). After commit 5943634f ("ipv4: Maintain redirect and PMTU info in struct rtable again."), ipv4_neigh_lookup() is performed without the rt_gateway assigned to the new_gw. In the case when rt_gateway (old_gw) isn't zero, the function uses it as the key. The neigh is most likely valid since the old_gw is the one that sends the ICMP redirect message. Then the new_gw is assigned to fib_nh_exception. The problem is: the new_gw ARP may never gets resolved and the traffic is blackholed. So, use the new_gw for neigh lookup. Changes from v1: - use __ipv4_neigh_lookup instead (per Eric Dumazet). Fixes: 5943634f ("ipv4: Maintain redirect and PMTU info in struct rtable again.") Signed-off-by: NStephen Suryaputra Lin <ssurya@ieee.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 11 11月, 2016 1 次提交
-
-
由 Lance Richardson 提交于
This is a follow-up to commit 9ee6c5dc ("ipv4: allow local fragmentation in ip_finish_output_gso()"), updating the comment documenting cases in which fragmentation is needed for egress GSO packets. Suggested-by: NShmulik Ladkani <shmulik.ladkani@gmail.com> Reviewed-by: NShmulik Ladkani <shmulik.ladkani@gmail.com> Signed-off-by: NLance Richardson <lrichard@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 10 11月, 2016 7 次提交
-
-
由 Eric Dumazet 提交于
After commit 6ed46d12 ("sock_diag: align nlattr properly when needed"), tcp_get_info() gets 64bit aligned memory, so we can avoid the unaligned helpers. Suggested-by: NDavid Miller <davem@davemloft.net> Signed-off-by: NEric Dumazet <edumazet@google.com> Acked-by: NSoheil Hassas Yeganeh <soheil@google.com> Acked-by: NYuchung Cheng <ycheng@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 David Ahern 提交于
Lorenzo noted an Android unit test failed due to e0d56fdd: "The expectation in the test was that the RST replying to a SYN sent to a closed port should be generated with oif=0. In other words it should not prefer the interface where the SYN came in on, but instead should follow whatever the routing table says it should do." Revert the change to ip_send_unicast_reply and tcp_v6_send_response such that the oif in the flow is set to the skb_iif only if skb_iif is an L3 master. Fixes: e0d56fdd ("net: l3mdev: remove redundant calls") Reported-by: NLorenzo Colitti <lorenzo@google.com> Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com> Tested-by: NLorenzo Colitti <lorenzo@google.com> Acked-by: NLorenzo Colitti <lorenzo@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 David Ahern 提交于
icmp_send is called in response to some event. The skb may not have the device set (skb->dev is NULL), but it is expected to have an rt. Update icmp_route_lookup to use the rt on the skb to determine L3 domain. Fixes: 613d09b3 ("net: Use VRF device index for lookups on TX") Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Arnd Bergmann 提交于
Since commit ca065d0c ("udp: no longer use SLAB_DESTROY_BY_RCU") the udp6_lib_lookup and udp4_lib_lookup functions are only provided when it is actually possible to call them. However, moving the callers now caused a link error: net/built-in.o: In function `nf_sk_lookup_slow_v6': (.text+0x131a39): undefined reference to `udp6_lib_lookup' net/ipv4/netfilter/nf_socket_ipv4.o: In function `nf_sk_lookup_slow_v4': nf_socket_ipv4.c:(.text.nf_sk_lookup_slow_v4+0x114): undefined reference to `udp4_lib_lookup' This extends the #ifdef so we also provide the functions when CONFIG_NF_SOCKET_IPV4 or CONFIG_NF_SOCKET_IPV6, respectively are set. Fixes: 8db4c5be ("netfilter: move socket lookup infrastructure to nf_socket_ipv{4,6}.c") Signed-off-by: NArnd Bergmann <arnd@arndb.de> Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
-
由 Davide Caratti 提交于
modify registration and deregistration of layer-4 protocol trackers to facilitate inclusion of new elements into the current list of builtin protocols. Both builtin (TCP, UDP, ICMP) and non-builtin (DCCP, GRE, SCTP, UDPlite) layer-4 protocol trackers usually register/deregister themselves using consecutive calls to nf_ct_l4proto_{,pernet}_{,un}register(...). This sequence is interrupted and rolled back in case of error; in order to simplify addition of builtin protocols, the input of the above functions has been modified to allow registering/unregistering multiple protocols. Signed-off-by: NDavide Caratti <dcaratti@redhat.com> Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
-
由 Eric Dumazet 提交于
We had various problems in the past in tcp_get_info() and used specific synchronization to avoid deadlocks. We would like to add more instrumentation points for TCP, and avoiding grabing socket lock in tcp_getinfo() was too costly. Being able to lock the socket allows to provide consistent set of fields. inet_diag_dump_icsk() can make sure ehash locks are not held any more when tcp_get_info() is called. We can remove syncp added in commit d654976c ("tcp: fix a potential deadlock in tcp_get_info()"), but we need to use lock_sock_fast() instead of spin_lock_bh() since TCP input path can now be run from process context. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NYuchung Cheng <ycheng@google.com> Acked-by: NSoheil Hassas Yeganeh <soheil@google.com> Acked-by: NNeal Cardwell <ncardwell@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
Being lockless in tcp_get_info() is hard, because we need to add specific synchronization in TCP fast path, like seqcount. Following patch will change inet_diag_dump_icsk() to no longer hold any lock for non listeners, so that we can properly acquire socket lock in get_tcp_info() and let it return more consistent counters. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NYuchung Cheng <ycheng@google.com> Acked-by: NSoheil Hassas Yeganeh <soheil@google.com> Acked-by: NNeal Cardwell <ncardwell@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 08 11月, 2016 4 次提交
-
-
由 Alexander Duyck 提交于
The display of /proc/net/route has had a couple issues due to the fact that when I originally rewrote most of fib_trie I made it so that the iterator was tracking the next value to use instead of the current. In addition it had an off by 1 error where I was tracking the first piece of data as position 0, even though in reality that belonged to the SEQ_START_TOKEN. This patch updates the code so the iterator tracks the last reported position and key instead of the next expected position and key. In addition it shifts things so that all of the leaves start at 1 instead of trying to report leaves starting with offset 0 as being valid. With these two issues addressed this should resolve any off by one errors that were present in the display of /proc/net/route. Fixes: 25b97c01 ("ipv4: off-by-one in continuation handling in /proc/net/route") Cc: Andy Whitcroft <apw@canonical.com> Reported-by: NJason Baron <jbaron@akamai.com> Tested-by: NJason Baron <jbaron@akamai.com> Signed-off-by: NAlexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Paolo Abeni 提交于
A new argument is added to __skb_recv_datagram to provide an explicit skb destructor, invoked under the receive queue lock. The UDP protocol uses such argument to perform memory reclaiming on dequeue, so that the UDP protocol does not set anymore skb->desctructor. Instead explicit memory reclaiming is performed at close() time and when skbs are removed from the receive queue. The in kernel UDP protocol users now need to call a skb_recv_udp() variant instead of skb_recv_datagram() to properly perform memory accounting on dequeue. Overall, this allows acquiring only once the receive queue lock on dequeue. Tested using pktgen with random src port, 64 bytes packet, wire-speed on a 10G link as sender and udp_sink as the receiver, using an l4 tuple rxhash to stress the contention, and one or more udp_sink instances with reuseport. nr sinks vanilla patched 1 440 560 3 2150 2300 6 3650 3800 9 4450 4600 12 6250 6450 v1 -> v2: - do rmem and allocated memory scheduling under the receive lock - do bulk scheduling in first_packet_length() and in udp_destruct_sock() - avoid the typdef for the dequeue callback Suggested-by: NEric Dumazet <edumazet@google.com> Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: NPaolo Abeni <pabeni@redhat.com> Acked-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Paolo Abeni 提交于
So that we can use it even after orphaining the skbuff. Suggested-by: NEric Dumazet <eric.dumazet@gmail.com> Signed-off-by: NPaolo Abeni <pabeni@redhat.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 David Ahern 提交于
Binding a raw socket to a local address fails if the socket is bound to an L3 domain: $ vrf-test -s -l 10.100.1.2 -R -I red error binding socket: 99: Cannot assign requested address Update raw_bind to look consider if sk_bound_dev_if is bound to an L3 domain and use inet_addr_type_table to lookup the address. Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 05 11月, 2016 2 次提交
-
-
由 Lorenzo Colitti 提交于
- Use the UID in routing lookups made by protocol connect() and sendmsg() functions. - Make sure that routing lookups triggered by incoming packets (e.g., Path MTU discovery) take the UID of the socket into account. - For packets not associated with a userspace socket, (e.g., ping replies) use UID 0 inside the user namespace corresponding to the network namespace the socket belongs to. This allows all namespaces to apply routing and iptables rules to kernel-originated traffic in that namespaces by matching UID 0. This is better than using the UID of the kernel socket that is sending the traffic, because the UID of kernel sockets created at namespace creation time (e.g., the per-processor ICMP and TCP sockets) is the UID of the user that created the socket, which might not be mapped in the namespace. Tested: compiles allnoconfig, allyesconfig, allmodconfig Tested: https://android-review.googlesource.com/253302Signed-off-by: NLorenzo Colitti <lorenzo@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Lorenzo Colitti 提交于
- Define a new FIB rule attributes, FRA_UID_RANGE, to describe a range of UIDs. - Define a RTA_UID attribute for per-UID route lookups and dumps. - Support passing these attributes to and from userspace via rtnetlink. The value INVALID_UID indicates no UID was specified. - Add a UID field to the flow structures. Signed-off-by: NLorenzo Colitti <lorenzo@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 04 11月, 2016 7 次提交
-
-
由 Eric Dumazet 提交于
After my commit, tcp_sendmsg() might restart its loop after processing socket backlog. If sk_err is set, we blindly return an error, even though we copied data to user space before. We should instead return number of bytes that could be copied, otherwise user space might resend data and corrupt the stream. This might happen if another thread is using recvmsg(MSG_ERRQUEUE) to process timestamps. Issue was diagnosed by Soheil and Willem, big kudos to them ! Fixes: d41a69f1 ("tcp: make tcp_sendmsg() aware of socket backlog") Signed-off-by: NEric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Tested-by: NSoheil Hassas Yeganeh <soheil@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Lance Richardson 提交于
Some configurations (e.g. geneve interface with default MTU of 1500 over an ethernet interface with 1500 MTU) result in the transmission of packets that exceed the configured MTU. While this should be considered to be a "bad" configuration, it is still allowed and should not result in the sending of packets that exceed the configured MTU. Fix by dropping the assumption in ip_finish_output_gso() that locally originated gso packets will never need fragmentation. Basic testing using iperf (observing CPU usage and bandwidth) have shown no measurable performance impact for traffic not requiring fragmentation. Fixes: c7ba65d7 ("net: ip: push gso skb forwarding handling down the stack") Reported-by: NJan Tluka <jtluka@redhat.com> Signed-off-by: NLance Richardson <lrichard@redhat.com> Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Willem de Bruijn 提交于
The IP stack records the largest fragment of a reassembled packet in IPCB(skb)->frag_max_size. When reading a datagram or raw packet that arrived fragmented, expose the value to allow applications to estimate receive path MTU. Tested: Sent data over a veth pair of which the source has a small mtu. Sent data using netcat, received using a dedicated process. Verified that the cmsg IP_RECVFRAGSIZE is returned only when data arrives fragmented, and in that cases matches the veth mtu. ip link add veth0 type veth peer name veth1 ip netns add from ip netns add to ip link set dev veth1 netns to ip netns exec to ip addr add dev veth1 192.168.10.1/24 ip netns exec to ip link set dev veth1 up ip link set dev veth0 netns from ip netns exec from ip addr add dev veth0 192.168.10.2/24 ip netns exec from ip link set dev veth0 up ip netns exec from ip link set dev veth0 mtu 1300 ip netns exec from ethtool -K veth0 ufo off dd if=/dev/zero bs=1 count=1400 2>/dev/null > payload ip netns exec to ./recv_cmsg_recvfragsize -4 -u -p 6000 & ip netns exec from nc -q 1 -u 192.168.10.1 6000 < payload using github.com/wdebruij/kerneltools/blob/master/tests/recvfragsize.c Signed-off-by: NWillem de Bruijn <willemb@google.com> Acked-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
Imagine initial value of max_skb_frags is 17, and last skb in write queue has 15 frags. Then max_skb_frags is lowered to 14 or smaller value. tcp_sendmsg() will then be allowed to add additional page frags and eventually go past MAX_SKB_FRAGS, overflowing struct skb_shared_info. Fixes: 5f74f82e ("net:Add sysctl_max_skb_frags") Signed-off-by: NEric Dumazet <edumazet@google.com> Cc: Hans Westgaard Ry <hans.westgaard.ry@oracle.com> Cc: Håkon Bugge <haakon.bugge@oracle.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Cyrill Gorcunov 提交于
I managed to miss that sk_for_each is called under "for" cycle so need to use goto here to return matching socket. CC: David S. Miller <davem@davemloft.net> CC: Eric Dumazet <eric.dumazet@gmail.com> CC: David Ahern <dsa@cumulusnetworks.com> CC: Andrey Vagin <avagin@openvz.org> CC: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org> Acked-by: NDavid Ahern <dsa@cumulusnetworks.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Cyrill Gorcunov 提交于
In raw_diag_destroy the helper raw_sock_get returns with sock_hold call, so we have to put it then. CC: David S. Miller <davem@davemloft.net> CC: Eric Dumazet <eric.dumazet@gmail.com> CC: David Ahern <dsa@cumulusnetworks.com> CC: Andrey Vagin <avagin@openvz.org> CC: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org> Acked-by: NDavid Ahern <dsa@cumulusnetworks.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 WANG Cong 提交于
Andrey reported this kernel warning: WARNING: CPU: 0 PID: 4608 at kernel/sched/core.c:7724 __might_sleep+0x14c/0x1a0 kernel/sched/core.c:7719 do not call blocking ops when !TASK_RUNNING; state=1 set at [<ffffffff811f5a5c>] prepare_to_wait+0xbc/0x210 kernel/sched/wait.c:178 Modules linked in: CPU: 0 PID: 4608 Comm: syz-executor Not tainted 4.9.0-rc2+ #320 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 ffff88006625f7a0 ffffffff81b46914 ffff88006625f818 0000000000000000 ffffffff84052960 0000000000000000 ffff88006625f7e8 ffffffff81111237 ffff88006aceac00 ffffffff00001e2c ffffed000cc4beff ffffffff84052960 Call Trace: [< inline >] __dump_stack lib/dump_stack.c:15 [<ffffffff81b46914>] dump_stack+0xb3/0x10f lib/dump_stack.c:51 [<ffffffff81111237>] __warn+0x1a7/0x1f0 kernel/panic.c:550 [<ffffffff8111132c>] warn_slowpath_fmt+0xac/0xd0 kernel/panic.c:565 [<ffffffff811922fc>] __might_sleep+0x14c/0x1a0 kernel/sched/core.c:7719 [< inline >] slab_pre_alloc_hook mm/slab.h:393 [< inline >] slab_alloc_node mm/slub.c:2634 [< inline >] slab_alloc mm/slub.c:2716 [<ffffffff81508da0>] __kmalloc_track_caller+0x150/0x2a0 mm/slub.c:4240 [<ffffffff8146be14>] kmemdup+0x24/0x50 mm/util.c:113 [<ffffffff8388b2cf>] dccp_feat_clone_sp_val.part.5+0x4f/0xe0 net/dccp/feat.c:374 [< inline >] dccp_feat_clone_sp_val net/dccp/feat.c:1141 [< inline >] dccp_feat_change_recv net/dccp/feat.c:1141 [<ffffffff8388d491>] dccp_feat_parse_options+0xaa1/0x13d0 net/dccp/feat.c:1411 [<ffffffff83894f01>] dccp_parse_options+0x721/0x1010 net/dccp/options.c:128 [<ffffffff83891280>] dccp_rcv_state_process+0x200/0x15b0 net/dccp/input.c:644 [<ffffffff838b8a94>] dccp_v4_do_rcv+0xf4/0x1a0 net/dccp/ipv4.c:681 [< inline >] sk_backlog_rcv ./include/net/sock.h:872 [<ffffffff82b7ceb6>] __release_sock+0x126/0x3a0 net/core/sock.c:2044 [<ffffffff82b7d189>] release_sock+0x59/0x1c0 net/core/sock.c:2502 [< inline >] inet_wait_for_connect net/ipv4/af_inet.c:547 [<ffffffff8316b2a2>] __inet_stream_connect+0x5d2/0xbb0 net/ipv4/af_inet.c:617 [<ffffffff8316b8d5>] inet_stream_connect+0x55/0xa0 net/ipv4/af_inet.c:656 [<ffffffff82b705e4>] SYSC_connect+0x244/0x2f0 net/socket.c:1533 [<ffffffff82b72dd4>] SyS_connect+0x24/0x30 net/socket.c:1514 [<ffffffff83fbf701>] entry_SYSCALL_64_fastpath+0x1f/0xc2 arch/x86/entry/entry_64.S:209 Unlike commit 26cabd31 ("sched, net: Clean up sk_wait_event() vs. might_sleep()"), the sleeping function is called before schedule_timeout(), this is indeed a bug. Fix this by moving the wait logic to the new API, it is similar to commit ff960a73 ("netdev, sched/wait: Fix sleeping inside wait event"). Reported-by: NAndrey Konovalov <andreyknvl@google.com> Cc: Andrey Konovalov <andreyknvl@google.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 03 11月, 2016 4 次提交
-
-
由 Pablo Neira Ayuso 提交于
Don't copy relevant fields from hook state structure, instead use the one that is already available in struct xt_action_param. This patch also adds a set of new wrapper functions to fetch relevant hook state structure fields. Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
-
由 Pablo Neira Ayuso 提交于
Place pointer to hook state in xt_action_param structure instead of copying the fields that we need. After this change xt_action_param fits into one cacheline. This patch also adds a set of new wrapper functions to fetch relevant hook state structure fields. Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
-
由 Cyrill Gorcunov 提交于
While being preparing patches for killing raw sockets via diag netlink interface I noticed that my runs are stuck: | [root@pcs7 ~]# cat /proc/`pidof ss`/stack | [<ffffffff816d1a76>] __lock_sock+0x80/0xc4 | [<ffffffff816d206a>] lock_sock_nested+0x47/0x95 | [<ffffffff8179ded6>] udp_disconnect+0x19/0x33 | [<ffffffff8179b517>] raw_abort+0x33/0x42 | [<ffffffff81702322>] sock_diag_destroy+0x4d/0x52 which has not been the case before. I narrowed it down to the commit | commit 286c72de | Author: Eric Dumazet <edumazet@google.com> | Date: Thu Oct 20 09:39:40 2016 -0700 | | udp: must lock the socket in udp_disconnect() where we start locking the socket for different reason. So the raw_abort escaped the renaming and we have to fix this typo using __udp_disconnect instead. Fixes: 286c72de ("udp: must lock the socket in udp_disconnect()") CC: David S. Miller <davem@davemloft.net> CC: Eric Dumazet <eric.dumazet@gmail.com> CC: David Ahern <dsa@cumulusnetworks.com> CC: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> CC: James Morris <jmorris@namei.org> CC: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> CC: Patrick McHardy <kaber@trash.net> CC: Andrey Vagin <avagin@openvz.org> CC: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org> Acked-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
As Ilya Lesokhin suggested, we can collapse two skbs at retransmit time even if the skb at the right has fragments. We simply have to use more generic skb_copy_bits() instead of skb_copy_from_linear_data() in tcp_collapse_retrans() Also need to guard this skb_copy_bits() in case there is nothing to copy, otherwise skb_put() could panic if left skb has frags. Tested: Used following packetdrill test // Establish a connection. 0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 +0 bind(3, ..., ...) = 0 +0 listen(3, 1) = 0 +0 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 8> +0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 8> +.100 < . 1:1(0) ack 1 win 257 +0 accept(3, ..., ...) = 4 +0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0 +0 write(4, ..., 200) = 200 +0 > P. 1:201(200) ack 1 +.001 write(4, ..., 200) = 200 +0 > P. 201:401(200) ack 1 +.001 write(4, ..., 200) = 200 +0 > P. 401:601(200) ack 1 +.001 write(4, ..., 200) = 200 +0 > P. 601:801(200) ack 1 +.001 write(4, ..., 200) = 200 +0 > P. 801:1001(200) ack 1 +.001 write(4, ..., 100) = 100 +0 > P. 1001:1101(100) ack 1 +.001 write(4, ..., 100) = 100 +0 > P. 1101:1201(100) ack 1 +.001 write(4, ..., 100) = 100 +0 > P. 1201:1301(100) ack 1 +.001 write(4, ..., 100) = 100 +0 > P. 1301:1401(100) ack 1 +.100 < . 1:1(0) ack 1 win 257 <nop,nop,sack 1001:1401> // Check that TCP collapse works : +0 > P. 1:1001(1000) ack 1 Reported-by: NIlya Lesokhin <ilyal@mellanox.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Acked-by: NNeal Cardwell <ncardwell@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 02 11月, 2016 2 次提交
-
-
由 Pablo Neira Ayuso 提交于
We need this split to reuse existing codebase for the upcoming nf_tables socket expression. Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
-
由 Florian Westphal 提交于
Add FIB expression, supported for ipv4, ipv6 and inet family (the latter just dispatches to ipv4 or ipv6 one based on nfproto). Currently supports fetching output interface index/name and the rtm_type associated with an address. This can be used for adding path filtering. rtm_type is useful to e.g. enforce a strong-end host model where packets are only accepted if daddr is configured on the interface the packet arrived on. The fib expression is a native nftables alternative to the xtables addrtype and rp_filter matches. FIB result order for oif/oifname retrieval is as follows: - if packet is local (skb has rtable, RTF_LOCAL set, this will also catch looped-back multicast packets), set oif to the loopback interface. - if fib lookup returns an error, or result points to local, store zero result. This means '--local' option of -m rpfilter is not supported. It is possible to use 'fib type local' or add explicit saddr/daddr matching rules to create exceptions if this is really needed. - store result in the destination register. In case of multiple routes, search set for desired oif in case strict matching is requested. ipv4 and ipv6 behave fib expressions are supposed to behave the same. [ I have collapsed Arnd Bergmann's ("netfilter: nf_tables: fib warnings") http://patchwork.ozlabs.org/patch/688615/ to address fallout from this patch after rebasing nf-next, that was posted to address compilation warnings. --pablo ] Signed-off-by: NFlorian Westphal <fw@strlen.de> Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
-
- 01 11月, 2016 1 次提交
-
-
由 David Ahern 提交于
Enable support for IPv4 multicast: - similar to unicast the flow struct is updated to L3 master device if relevant prior to calling fib_rules_lookup. The table id is saved to the lookup arg so the rule action for ipmr can return the table associated with the device. - ip_mr_forward needs to check for master device mismatch as well since the skb->dev is set to it - allow multicast address on VRF device for Rx by checking for the daddr in the VRF device as well as the original ingress device - on Tx need to drop to __mkroute_output when FIB lookup fails for multicast destination address. - if CONFIG_IP_MROUTE_MULTIPLE_TABLES is enabled VRF driver creates IPMR FIB rules on first device create similar to FIB rules. In addition the VRF driver does not divert IPv4 multicast packets: it breaks on Tx since the fib lookup fails on the mcast address. With this patch, ipmr forwarding and local rx/tx work. Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-