- 29 1月, 2016 2 次提交
-
-
由 Neal Cardwell 提交于
This commit fixes a corner case in tcp_mark_head_lost() which was causing the WARN_ON(len > skb->len) in tcp_fragment() to fire. tcp_mark_head_lost() was assuming that if a packet has tcp_skb_pcount(skb) of N, then it's safe to fragment off a prefix of M*mss bytes, for any M < N. But with the tricky way TCP pcounts are maintained, this is not always true. For example, suppose the sender sends 4 1-byte packets and have the last 3 packet sacked. It will merge the last 3 packets in the write queue into an skb with pcount = 3 and len = 3 bytes. If another recovery happens after a sack reneging event, tcp_mark_head_lost() may attempt to split the skb assuming it has more than 2*MSS bytes. This sounds very counterintuitive, but as the commit description for the related commit c0638c24 ("tcp: don't fragment SACKed skbs in tcp_mark_head_lost()") notes, this is because tcp_shifted_skb() coalesces adjacent regions of SACKed skbs, and when doing this it preserves the sum of their packet counts in order to reflect the real-world dynamics on the wire. The c0638c24 commit tried to avoid problems by not fragmenting SACKed skbs, since SACKed skbs are where the non-proportionality between pcount and skb->len/mss is known to be possible. However, that commit did not handle the case where during a reneging event one of these weird SACKed skbs becomes an un-SACKed skb, which tcp_mark_head_lost() can then try to fragment. The fix is to simply mark the entire skb lost when this happens. This makes the recovery slightly more aggressive in such corner cases before we detect reordering. But once we detect reordering this code path is by-passed because FACK is disabled. Signed-off-by: NNeal Cardwell <ncardwell@google.com> Signed-off-by: NYuchung Cheng <ycheng@google.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Joe Stringer 提交于
Later parts of the stack (including fragmentation) expect that there is never a socket attached to frag in a frag_list, however this invariant was not enforced on all defrag paths. This could lead to the BUG_ON(skb->sk) during ip_do_fragment(), as per the call stack at the end of this commit message. While the call could be added to openvswitch to fix this particular error, the head and tail of the frags list are already orphaned indirectly inside ip_defrag(), so it seems like the remaining fragments should all be orphaned in all circumstances. kernel BUG at net/ipv4/ip_output.c:586! [...] Call Trace: <IRQ> [<ffffffffa0205270>] ? do_output.isra.29+0x1b0/0x1b0 [openvswitch] [<ffffffffa02167a7>] ovs_fragment+0xcc/0x214 [openvswitch] [<ffffffff81667830>] ? dst_discard_out+0x20/0x20 [<ffffffff81667810>] ? dst_ifdown+0x80/0x80 [<ffffffffa0212072>] ? find_bucket.isra.2+0x62/0x70 [openvswitch] [<ffffffff810e0ba5>] ? mod_timer_pending+0x65/0x210 [<ffffffff810b732b>] ? __lock_acquire+0x3db/0x1b90 [<ffffffffa03205a2>] ? nf_conntrack_in+0x252/0x500 [nf_conntrack] [<ffffffff810b63c4>] ? __lock_is_held+0x54/0x70 [<ffffffffa02051a3>] do_output.isra.29+0xe3/0x1b0 [openvswitch] [<ffffffffa0206411>] do_execute_actions+0xe11/0x11f0 [openvswitch] [<ffffffff810b63c4>] ? __lock_is_held+0x54/0x70 [<ffffffffa0206822>] ovs_execute_actions+0x32/0xd0 [openvswitch] [<ffffffffa020b505>] ovs_dp_process_packet+0x85/0x140 [openvswitch] [<ffffffff810b63c4>] ? __lock_is_held+0x54/0x70 [<ffffffffa02068a2>] ovs_execute_actions+0xb2/0xd0 [openvswitch] [<ffffffffa020b505>] ovs_dp_process_packet+0x85/0x140 [openvswitch] [<ffffffffa0215019>] ? ovs_ct_get_labels+0x49/0x80 [openvswitch] [<ffffffffa0213a1d>] ovs_vport_receive+0x5d/0xa0 [openvswitch] [<ffffffff810b732b>] ? __lock_acquire+0x3db/0x1b90 [<ffffffff810b732b>] ? __lock_acquire+0x3db/0x1b90 [<ffffffff810b732b>] ? __lock_acquire+0x3db/0x1b90 [<ffffffffa0214895>] ? internal_dev_xmit+0x5/0x140 [openvswitch] [<ffffffffa02148fc>] internal_dev_xmit+0x6c/0x140 [openvswitch] [<ffffffffa0214895>] ? internal_dev_xmit+0x5/0x140 [openvswitch] [<ffffffff81660299>] dev_hard_start_xmit+0x2b9/0x5e0 [<ffffffff8165fc21>] ? netif_skb_features+0xd1/0x1f0 [<ffffffff81660f20>] __dev_queue_xmit+0x800/0x930 [<ffffffff81660770>] ? __dev_queue_xmit+0x50/0x930 [<ffffffff810b53f1>] ? mark_held_locks+0x71/0x90 [<ffffffff81669876>] ? neigh_resolve_output+0x106/0x220 [<ffffffff81661060>] dev_queue_xmit+0x10/0x20 [<ffffffff816698e8>] neigh_resolve_output+0x178/0x220 [<ffffffff816a8e6f>] ? ip_finish_output2+0x1ff/0x590 [<ffffffff816a8e6f>] ip_finish_output2+0x1ff/0x590 [<ffffffff816a8cee>] ? ip_finish_output2+0x7e/0x590 [<ffffffff816a9a31>] ip_do_fragment+0x831/0x8a0 [<ffffffff816a8c70>] ? ip_copy_metadata+0x1b0/0x1b0 [<ffffffff816a9ae3>] ip_fragment.constprop.49+0x43/0x80 [<ffffffff816a9c9c>] ip_finish_output+0x17c/0x340 [<ffffffff8169a6f4>] ? nf_hook_slow+0xe4/0x190 [<ffffffff816ab4c0>] ip_output+0x70/0x110 [<ffffffff816a9b20>] ? ip_fragment.constprop.49+0x80/0x80 [<ffffffff816aa9f9>] ip_local_out+0x39/0x70 [<ffffffff816abf89>] ip_send_skb+0x19/0x40 [<ffffffff816abfe3>] ip_push_pending_frames+0x33/0x40 [<ffffffff816df21a>] icmp_push_reply+0xea/0x120 [<ffffffff816df93d>] icmp_reply.constprop.23+0x1ed/0x230 [<ffffffff816df9ce>] icmp_echo.part.21+0x4e/0x50 [<ffffffff810b63c4>] ? __lock_is_held+0x54/0x70 [<ffffffff810d5f9e>] ? rcu_read_lock_held+0x5e/0x70 [<ffffffff816dfa06>] icmp_echo+0x36/0x70 [<ffffffff816e0d11>] icmp_rcv+0x271/0x450 [<ffffffff816a4ca7>] ip_local_deliver_finish+0x127/0x3a0 [<ffffffff816a4bc1>] ? ip_local_deliver_finish+0x41/0x3a0 [<ffffffff816a5160>] ip_local_deliver+0x60/0xd0 [<ffffffff816a4b80>] ? ip_rcv_finish+0x560/0x560 [<ffffffff816a46fd>] ip_rcv_finish+0xdd/0x560 [<ffffffff816a5453>] ip_rcv+0x283/0x3e0 [<ffffffff810b6302>] ? match_held_lock+0x192/0x200 [<ffffffff816a4620>] ? inet_del_offload+0x40/0x40 [<ffffffff8165d062>] __netif_receive_skb_core+0x392/0xae0 [<ffffffff8165e68e>] ? process_backlog+0x8e/0x230 [<ffffffff810b53f1>] ? mark_held_locks+0x71/0x90 [<ffffffff8165d7c8>] __netif_receive_skb+0x18/0x60 [<ffffffff8165e678>] process_backlog+0x78/0x230 [<ffffffff8165e6dd>] ? process_backlog+0xdd/0x230 [<ffffffff8165e355>] net_rx_action+0x155/0x400 [<ffffffff8106b48c>] __do_softirq+0xcc/0x420 [<ffffffff816a8e87>] ? ip_finish_output2+0x217/0x590 [<ffffffff8178e78c>] do_softirq_own_stack+0x1c/0x30 <EOI> [<ffffffff8106b88e>] do_softirq+0x4e/0x60 [<ffffffff8106b948>] __local_bh_enable_ip+0xa8/0xb0 [<ffffffff816a8eb0>] ip_finish_output2+0x240/0x590 [<ffffffff816a9a31>] ? ip_do_fragment+0x831/0x8a0 [<ffffffff816a9a31>] ip_do_fragment+0x831/0x8a0 [<ffffffff816a8c70>] ? ip_copy_metadata+0x1b0/0x1b0 [<ffffffff816a9ae3>] ip_fragment.constprop.49+0x43/0x80 [<ffffffff816a9c9c>] ip_finish_output+0x17c/0x340 [<ffffffff8169a6f4>] ? nf_hook_slow+0xe4/0x190 [<ffffffff816ab4c0>] ip_output+0x70/0x110 [<ffffffff816a9b20>] ? ip_fragment.constprop.49+0x80/0x80 [<ffffffff816aa9f9>] ip_local_out+0x39/0x70 [<ffffffff816abf89>] ip_send_skb+0x19/0x40 [<ffffffff816abfe3>] ip_push_pending_frames+0x33/0x40 [<ffffffff816d55d3>] raw_sendmsg+0x7d3/0xc30 [<ffffffff810b732b>] ? __lock_acquire+0x3db/0x1b90 [<ffffffff816e7557>] ? inet_sendmsg+0xc7/0x1d0 [<ffffffff810b63c4>] ? __lock_is_held+0x54/0x70 [<ffffffff816e759a>] inet_sendmsg+0x10a/0x1d0 [<ffffffff816e7495>] ? inet_sendmsg+0x5/0x1d0 [<ffffffff8163e398>] sock_sendmsg+0x38/0x50 [<ffffffff8163ec5f>] ___sys_sendmsg+0x25f/0x270 [<ffffffff811aadad>] ? handle_mm_fault+0x8dd/0x1320 [<ffffffff8178c147>] ? _raw_spin_unlock+0x27/0x40 [<ffffffff810529b2>] ? __do_page_fault+0x1e2/0x460 [<ffffffff81204886>] ? __fget_light+0x66/0x90 [<ffffffff8163f8e2>] __sys_sendmsg+0x42/0x80 [<ffffffff8163f932>] SyS_sendmsg+0x12/0x20 [<ffffffff8178cb17>] entry_SYSCALL_64_fastpath+0x12/0x6f Code: 00 00 44 89 e0 e9 7c fb ff ff 4c 89 ff e8 e7 e7 ff ff 41 8b 9d 80 00 00 00 2b 5d d4 89 d8 c1 f8 03 0f b7 c0 e9 33 ff ff f 66 66 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 48 RIP [<ffffffff816a9a92>] ip_do_fragment+0x892/0x8a0 RSP <ffff88006d603170> Fixes: 7f8a436e ("openvswitch: Add conntrack action") Signed-off-by: NJoe Stringer <joe@ovn.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 26 1月, 2016 1 次提交
-
-
由 Thomas Egerer 提交于
The ESP algorithms using CBC mode require echainiv. Hence INET*_ESP have to select CRYPTO_ECHAINIV in order to work properly. This solves the issues caused by a misconfiguration as described in [1]. The original approach, patching crypto/Kconfig was turned down by Herbert Xu [2]. [1] https://lists.strongswan.org/pipermail/users/2015-December/009074.html [2] http://marc.info/?l=linux-crypto-vger&m=145224655809562&w=2Signed-off-by: NThomas Egerer <hakke_007@gmx.de> Acked-by: NHerbert Xu <herbert@gondor.apana.org.au> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 22 1月, 2016 1 次提交
-
-
由 Eric Dumazet 提交于
Neal reported crashes with this stack trace : RIP: 0010:[<ffffffff8c57231b>] tcp_v4_send_ack+0x41/0x20f ... CR2: 0000000000000018 CR3: 000000044005c000 CR4: 00000000001427e0 ... [<ffffffff8c57258e>] tcp_v4_reqsk_send_ack+0xa5/0xb4 [<ffffffff8c1a7caa>] tcp_check_req+0x2ea/0x3e0 [<ffffffff8c19e420>] tcp_rcv_state_process+0x850/0x2500 [<ffffffff8c1a6d21>] tcp_v4_do_rcv+0x141/0x330 [<ffffffff8c56cdb2>] sk_backlog_rcv+0x21/0x30 [<ffffffff8c098bbd>] tcp_recvmsg+0x75d/0xf90 [<ffffffff8c0a8700>] inet_recvmsg+0x80/0xa0 [<ffffffff8c17623e>] sock_aio_read+0xee/0x110 [<ffffffff8c066fcf>] do_sync_read+0x6f/0xa0 [<ffffffff8c0673a1>] SyS_read+0x1e1/0x290 [<ffffffff8c5ca262>] system_call_fastpath+0x16/0x1b The problem here is the skb we provide to tcp_v4_send_ack() had to be parked in the backlog of a new TCP fastopen child because this child was owned by the user at the time an out of window packet arrived. Before queuing a packet, TCP has to set skb->dev to NULL as the device could disappear before packet is removed from the queue. Fix this issue by using the net pointer provided by the socket (being a timewait or a request socket). IPv6 is immune to the bug : tcp_v6_send_response() already gets the net pointer from the socket if provided. Fixes: 168a8f58 ("tcp: TCP Fast Open Server - main code path") Reported-by: NNeal Cardwell <ncardwell@google.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Cc: Jerry Chu <hkchu@google.com> Cc: Yuchung Cheng <ycheng@google.com> Acked-by: NNeal Cardwell <ncardwell@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 21 1月, 2016 1 次提交
-
-
由 Eric Dumazet 提交于
Lorenzo reported that we could not properly find v4mapped sockets in inet_diag_find_one_icsk(). This patch fixes the issue. Reported-by: NLorenzo Colitti <lorenzo@google.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Acked-by: NLorenzo Colitti <lorenzo@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 20 1月, 2016 1 次提交
-
-
由 Eric Dumazet 提交于
Using a combination of connected and un-connected sockets, Dmitry was able to trigger soft lockups with his fuzzer. The problem is that sockets in the SO_REUSEPORT array might have different scores. Right after sk2=socket(), setsockopt(sk2,...,SO_REUSEPORT, on) and bind(sk2, ...), but _before_ the connect(sk2) is done, sk2 is added into the soreuseport array, with a score which is smaller than the score of first socket sk1 found in hash table (I am speaking of the regular UDP hash table), if sk1 had the connect() done, giving a +8 to its score. hash bucket [X] -> sk1 -> sk2 -> NULL sk1 score = 14 (because it did a connect()) sk2 score = 6 SO_REUSEPORT fast selection is an optimization. If it turns out the score of the selected socket does not match score of first socket, just fallback to old SO_REUSEPORT logic instead of trying to be too smart. Normal SO_REUSEPORT users do not mix different kind of sockets, as this mechanism is used for load balance traffic. Fixes: e32ea7e7 ("soreuseport: fast reuseport UDP socket selection") Reported-by: NDmitry Vyukov <dvyukov@google.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Cc: Craig Gallek <kraigatgoog@gmail.com> Acked-by: NCraig Gallek <kraig@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 16 1月, 2016 1 次提交
-
-
由 Konstantin Khlebnikov 提交于
Skb_gso_segment() uses skb control block during segmentation. This patch adds 32-bytes room for previous control block which will be copied into all resulting segments. This patch fixes kernel crash during fragmenting forwarded packets. Fragmentation requires valid IP CB in skb for clearing ip options. Also patch removes custom save/restore in ovs code, now it's redundant. Signed-off-by: NKonstantin Khlebnikov <koct9i@gmail.com> Link: http://lkml.kernel.org/r/CALYGNiP-0MZ-FExV2HutTvE9U-QQtkKSoE--KN=JQE5STYsjAA@mail.gmail.comSigned-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 15 1月, 2016 8 次提交
-
-
由 Johannes Weiner 提交于
According to <linux/jump_label.h> the direct use of struct static_key is deprecated. Update the socket and slab accounting code accordingly. Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NDavid S. Miller <davem@davemloft.net> Reported-by: NJason Baron <jbaron@akamai.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
The unified hierarchy memory controller is going to use this jump label as well to control the networking callbacks. Move it to the memory controller code and give it a more generic name. Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NMichal Hocko <mhocko@suse.com> Reviewed-by: NVladimir Davydov <vdavydov@virtuozzo.com> Acked-by: NDavid S. Miller <davem@davemloft.net> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
There won't be any separate counters for socket memory consumed by protocols other than TCP in the future. Remove the indirection and link sockets directly to their owning memory cgroup. Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Reviewed-by: NVladimir Davydov <vdavydov@virtuozzo.com> Acked-by: NDavid S. Miller <davem@davemloft.net> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
There won't be a tcp control soft limit, so integrating the memcg code into the global skmem limiting scheme complicates things unnecessarily. Replace this with simple and clear charge and uncharge calls--hidden behind a jump label--to account skb memory. Note that this is not purely aesthetic: as a result of shoehorning the per-memcg code into the same memory accounting functions that handle the global level, the old code would compare the per-memcg consumption against the smaller of the per-memcg limit and the global limit. This allowed the total consumption of multiple sockets to exceed the global limit, as long as the individual sockets stayed within bounds. After this change, the code will always compare the per-memcg consumption to the per-memcg limit, and the global consumption to the global limit, and thus close this loophole. Without a soft limit, the per-memcg memory pressure state in sockets is generally questionable. However, we did it until now, so we continue to enter it when the hard limit is hit, and packets are dropped, to let other sockets in the cgroup know that they shouldn't grow their transmit windows, either. However, keep it simple in the new callback model and leave memory pressure lazily when the next packet is accepted (as opposed to doing it synchroneously when packets are processed). When packets are dropped, network performance will already be in the toilet, so that should be a reasonable trade-off. As described above, consumption is now checked on the per-memcg level and the global level separately. Likewise, memory pressure states are maintained on both the per-memcg level and the global level, and a socket is considered under pressure when either level asserts as much. Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Reviewed-by: NVladimir Davydov <vdavydov@virtuozzo.com> Acked-by: NDavid S. Miller <davem@davemloft.net> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
tcp_memcontrol replicates the global sysctl_mem limit array per cgroup, but it only ever sets these entries to the value of the memory_allocated page_counter limit. Use the latter directly. Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Reviewed-by: NVladimir Davydov <vdavydov@virtuozzo.com> Acked-by: NDavid S. Miller <davem@davemloft.net> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
The number of allocated sockets is used for calculations in the soft limit phase, where packets are accepted but the socket is under memory pressure. Since there is no soft limit phase in tcp_memcontrol, and memory pressure is only entered when packets are already dropped, this is actually dead code. Remove it. As this is the last user of parent_cg_proto(), remove that too. Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NDavid S. Miller <davem@davemloft.net> Reviewed-by: NVladimir Davydov <vdavydov@virtuozzo.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
Move the jump-label from sock_update_memcg() and sock_release_memcg() to the callsite, and so eliminate those function calls when socket accounting is not enabled. This also eliminates the need for dummy functions because the calls will be optimized away if the Kconfig options are not enabled. Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NDavid S. Miller <davem@davemloft.net> Reviewed-by: NVladimir Davydov <vdavydov@virtuozzo.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Vladimir Davydov 提交于
There are two bits defined for cg_proto->flags - MEMCG_SOCK_ACTIVATED and MEMCG_SOCK_ACTIVE - both are set in tcp_update_limit, but the former is never cleared while the latter can be cleared by unsetting the limit. This allows to disable tcp socket accounting for new sockets after it was enabled by writing -1 to memory.kmem.tcp.limit_in_bytes while still guaranteeing that memcg_socket_limit_enabled static key will be decremented on memcg destruction. This functionality looks dubious, because it is not clear what a use case would be. By enabling tcp accounting a user accepts the price. If they then find the performance degradation unacceptable, they can always restart their workload with tcp accounting disabled. It does not seem there is any need to flip it while the workload is running. Besides, it contradicts to how kmem accounting API works: writing whatever to memory.kmem.limit_in_bytes enables kmem accounting for the cgroup in question, after which it cannot be disabled. Therefore one might expect that writing -1 to memory.kmem.tcp.limit_in_bytes just enables socket accounting w/o limiting it, which might be useful by itself, but it isn't true. Since this API peculiarity is not documented anywhere, I propose to drop it. This will allow to simplify the code by dropping cg_proto->flags. Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 12 1月, 2016 2 次提交
-
-
由 Michal Kubeček 提交于
Commit acf8dd0a ("udp: only allow UFO for packets from SOCK_DGRAM sockets") disallows UFO for packets sent from raw sockets. We need to do the same also for SOCK_DGRAM sockets with SO_NO_CHECK options, even if for a bit different reason: while such socket would override the CHECKSUM_PARTIAL set by ip_ufo_append_data(), gso_size is still set and bad offloading flags warning is triggered in __skb_gso_segment(). In the IPv6 case, SO_NO_CHECK option is ignored but we need to disallow UFO for packets sent by sockets with UDP_NO_CHECK6_TX option. Signed-off-by: NMichal Kubecek <mkubecek@suse.cz> Tested-by: NShannon Nelson <shannon.nelson@intel.com> Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Neal Cardwell 提交于
For tcp_yeah, use an ssthresh floor of 2, the same floor used by Reno and CUBIC, per RFC 5681 (equation 4). tcp_yeah_ssthresh() was sometimes returning a 0 or negative ssthresh value if the intended reduction is as big or bigger than the current cwnd. Congestion control modules should never return a zero or negative ssthresh. A zero ssthresh generally results in a zero cwnd, causing the connection to stall. A negative ssthresh value will be interpreted as a u32 and will set a target cwnd for PRR near 4 billion. Oleksandr Natalenko reported that a system using tcp_yeah with ECN could see a warning about a prior_cwnd of 0 in tcp_cwnd_reduction(). Testing verified that this was due to tcp_yeah_ssthresh() misbehaving in this way. Reported-by: NOleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: NNeal Cardwell <ncardwell@google.com> Signed-off-by: NYuchung Cheng <ycheng@google.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 11 1月, 2016 4 次提交
-
-
由 Nikolay Borisov 提交于
This is the final part required to namespaceify the tcp keep alive mechanism. Signed-off-by: NNikolay Borisov <kernel@kyup.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Nikolay Borisov 提交于
This is required to have full tcp keepalive mechanism namespace support. Signed-off-by: NNikolay Borisov <kernel@kyup.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Nikolay Borisov 提交于
Different net namespaces might have different requirements as to the keepalive time of tcp sockets. This might be required in cases where different firewall rules are in place which require tcp timeout sockets to be increased/decreased independently of the host. Signed-off-by: NNikolay Borisov <kernel@kyup.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Hannes Frederic Sowa 提交于
udp tunnel offloads tend to aggregate datagrams based on inner headers. gro engine gets notified by tunnel implementations about possible offloads. The match is solely based on the port number. Imagine a tunnel bound to port 53, the offloading will look into all DNS packets and tries to aggregate them based on the inner data found within. This could lead to data corruption and malformed DNS packets. While this patch minimizes the problem and helps an administrator to find the issue by querying ip tunnel/fou, a better way would be to match on the specific destination ip address so if a user space socket is bound to the same address it will conflict. Cc: Tom Herbert <tom@herbertland.com> Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 09 1月, 2016 2 次提交
-
-
由 Lance Richardson 提交于
Add lock release/acquire annotations to ping_seq_start() and ping_seq_stop() to satisfy sparse. Signed-off-by: NLance Richardson <lrichard@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Lance Richardson 提交于
Eliminate endianness mismatch warnings (reported by sparse) in this file by using appropriate nla_put_*()/nla_get_*() calls. Signed-off-by: NLance Richardson <lrichard@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 07 1月, 2016 1 次提交
-
-
由 Yuchung Cheng 提交于
Patch 3759824d ("tcp: PRR uses CRB mode by default and SS mode conditionally") introduced a bug that cwnd may become 0 when both inflight and sndcnt are 0 (cwnd = inflight + sndcnt). This may lead to a div-by-zero if the connection starts another cwnd reduction phase by setting tp->prior_cwnd to the current cwnd (0) in tcp_init_cwnd_reduction(). To prevent this we skip PRR operation when nothing is acked or sacked. Then cwnd must be positive in all cases as long as ssthresh is positive: 1) The proportional reduction mode inflight > ssthresh > 0 2) The reduction bound mode a) inflight == ssthresh > 0 b) inflight < ssthresh sndcnt > 0 since newly_acked_sacked > 0 and inflight < ssthresh Therefore in all cases inflight and sndcnt can not both be 0. We check invalid tp->prior_cwnd to avoid potential div0 bugs. In reality this bug is triggered only with a sequence of less common events. For example, the connection is terminating an ECN-triggered cwnd reduction with an inflight 0, then it receives reordered/old ACKs or DSACKs from prior transmission (which acks nothing). Or the connection is in fast recovery stage that marks everything lost, but fails to retransmit due to local issues, then receives data packets from other end which acks nothing. Fixes: 3759824d ("tcp: PRR uses CRB mode by default and SS mode conditionally") Reported-by: NOleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: NYuchung Cheng <ycheng@google.com> Signed-off-by: NNeal Cardwell <ncardwell@google.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 06 1月, 2016 2 次提交
-
-
由 Craig Gallek 提交于
This socket-lookup path did not pass along the skb in question in my original BPF-based socket selection patch. The skb in the udpN_lib_lookup2 path can be used for BPF-based socket selection just like it is in the 'traditional' udpN_lib_lookup path. udpN_lib_lookup2 kicks in when there are greater than 10 sockets in the same hlist slot. Coincidentally, I chose 10 sockets per reuseport group in my functional test, so the lookup2 path was not excersised. This adds an additional set of tests with 20 sockets. Fixes: 538950a1 ("soreuseport: setsockopt SO_ATTACH_REUSEPORT_[CE]BPF") Fixes: 3ca8e402 ("soreuseport: BPF selection functional test") Suggested-by: NEric Dumazet <eric.dumazet@gmail.com> Signed-off-by: NCraig Gallek <kraig@google.com> Acked-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Florian Westphal 提交于
The only user was removed in commit 029f7f3b ("netfilter: ipv6: nf_defrag: avoid/free clone operations"). Signed-off-by: NFlorian Westphal <fw@strlen.de> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 05 1月, 2016 4 次提交
-
-
由 David Ahern 提交于
Commands run in a vrf context are not failing as expected on a route lookup: root@kenny:~# ip ro ls table vrf-red unreachable default root@kenny:~# ping -I vrf-red -c1 -w1 10.100.1.254 ping: Warning: source address might be selected on device other than vrf-red. PING 10.100.1.254 (10.100.1.254) from 0.0.0.0 vrf-red: 56(84) bytes of data. --- 10.100.1.254 ping statistics --- 2 packets transmitted, 0 received, 100% packet loss, time 999ms Since the vrf table does not have a route for 10.100.1.254 the ping should have failed. The saddr lookup causes a full VRF table lookup. Propogating a lookup failure to the user allows the command to fail as expected: root@kenny:~# ping -I vrf-red -c1 -w1 10.100.1.254 connect: No route to host Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Craig Gallek 提交于
Expose socket options for setting a classic or extended BPF program for use when selecting sockets in an SO_REUSEPORT group. These options can be used on the first socket to belong to a group before bind or on any socket in the group after bind. This change includes refactoring of the existing sk_filter code to allow reuse of the existing BPF filter validation checks. Signed-off-by: NCraig Gallek <kraig@google.com> Acked-by: NAlexei Starovoitov <ast@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Craig Gallek 提交于
Include a struct sock_reuseport instance when a UDP socket binds to a specific address for the first time with the reuseport flag set. When selecting a socket for an incoming UDP packet, use the information available in sock_reuseport if present. This required adding an additional field to the UDP source address equality function to differentiate between exact and wildcard matches. The original use case allowed wildcard matches when checking for existing port uses during bind. The new use case of adding a socket to a reuseport group requires exact address matching. Performance test (using a machine with 2 CPU sockets and a total of 48 cores): Create reuseport groups of varying size. Use one socket from this group per user thread (pinning each thread to a different core) calling recvmmsg in a tight loop. Record number of messages received per second while saturating a 10G link. 10 sockets: 18% increase (~2.8M -> 3.3M pkts/s) 20 sockets: 14% increase (~2.9M -> 3.3M pkts/s) 40 sockets: 13% increase (~3.0M -> 3.4M pkts/s) This work is based off a similar implementation written by Ying Cai <ycai@google.com> for implementing policy-based reuseport selection. Signed-off-by: NCraig Gallek <kraig@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
Backport of this upstream commit into stable kernels : 89c22d8c ("net: Fix skb csum races when peeking") exposed a bug in udp stack vs MSG_PEEK support, when user provides a buffer smaller than skb payload. In this case, skb_copy_and_csum_datagram_iovec(skb, sizeof(struct udphdr), msg->msg_iov); returns -EFAULT. This bug does not happen in upstream kernels since Al Viro did a great job to replace this into : skb_copy_and_csum_datagram_msg(skb, sizeof(struct udphdr), msg); This variant is safe vs short buffers. For the time being, instead reverting Herbert Xu patch and add back skb->ip_summed invalid changes, simply store the result of udp_lib_checksum_complete() so that we avoid computing the checksum a second time, and avoid the problematic skb_copy_and_csum_datagram_iovec() call. This patch can be applied on recent kernels as it avoids a double checksumming, then backported to stable kernels as a bug fix. Signed-off-by: NEric Dumazet <edumazet@google.com> Acked-by: NHerbert Xu <herbert@gondor.apana.org.au> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 29 12月, 2015 1 次提交
-
-
由 Pablo Neira Ayuso 提交于
We have to release the existing objects on netns removal otherwise we leak them. Chains are unregistered in first place to make sure no packets are walking on our rules and sets anymore. The object release happens by when we unregister the family via nft_release_afinfo() which is called from nft_unregister_afinfo() from the corresponding __net_exit path in every family. Reported-by: NPatrick McHardy <kaber@trash.net> Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
-
- 26 12月, 2015 1 次提交
-
-
由 Pravin B Shelar 提交于
By moving stats update into iptunnel_xmit(), we can simplify iptunnel_xmit() usage. With this change there is no need to call another function (iptunnel_xmit_stats()) to update stats in tunnel xmit code path. Signed-off-by: NPravin B Shelar <pshelar@nicira.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 23 12月, 2015 3 次提交
-
-
由 Florian Westphal 提交于
Hannes points out that when we generate tcp reset for timewait sockets we pretend we found no socket and pass NULL sk to tcp_vX_send_reset(). Make it cope with inet tw sockets and then provide tw sk. This makes RSTs appear on correct interface when SO_BINDTODEVICE is used. Packetdrill test case: // want default route to be used, we rely on BINDTODEVICE `ip route del 192.0.2.0/24 via 192.168.0.2 dev tun0` 0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 // test case still works due to BINDTODEVICE 0.001 setsockopt(3, SOL_SOCKET, SO_BINDTODEVICE, "tun0", 4) = 0 0.100...0.200 connect(3, ..., ...) = 0 0.100 > S 0:0(0) <mss 1460,sackOK,nop,nop> 0.200 < S. 0:0(0) ack 1 win 32792 <mss 1460,sackOK,nop,nop> 0.200 > . 1:1(0) ack 1 0.210 close(3) = 0 0.210 > F. 1:1(0) ack 1 win 29200 0.300 < . 1:1(0) ack 2 win 46 // more data while in FIN_WAIT2, expect RST 1.300 < P. 1:1001(1000) ack 1 win 46 // fails without this change -- default route is used 1.301 > R 1:1(0) win 0 Reported-by: NHannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: NFlorian Westphal <fw@strlen.de> Acked-by: NEric Dumazet <edumazet@google.com> Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Florian Westphal 提交于
tcp_md5_do_lookup requires a full socket, so once we extend _send_reset() to also accept timewait socket we would have to change if (!sk && hash_location) to something like if ((!sk || !sk_fullsock(sk)) && hash_location) { ... } else { (sk && sk_fullsock(sk)) tcp_md5_do_lookup() } Switch the two branches: check if we have a socket first, then fall back to a listener lookup if we saw a md5 option (hash_location). Signed-off-by: NFlorian Westphal <fw@strlen.de> Acked-by: NEric Dumazet <edumazet@google.com> Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Lorenzo Colitti 提交于
When closing a listen socket, tcp_abort currently calls tcp_done without clearing the request queue. If the socket has a child socket that is established but not yet accepted, the child socket is then left without a parent, causing a leak. Fix this by setting the socket state to TCP_CLOSE and calling inet_csk_listen_stop with the socket lock held, like tcp_close does. Tested using net_test. With this patch, calling SOCK_DESTROY on a listen socket that has an established but not yet accepted child socket results in the parent and the child being closed, such that they no longer appear in sock_diag dumps. Reported-by: NEric Dumazet <eric.dumazet@gmail.com> Signed-off-by: NLorenzo Colitti <lorenzo@google.com> Acked-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 19 12月, 2015 3 次提交
-
-
由 Pravin B Shelar 提交于
IP-TTL case is already handled in ip_tunnel_ioctl() API. Signed-off-by: NPravin B Shelar <pshelar@nicira.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
Adding support for SYN_RECV request sockets to tcp_abort() is quite easy after our tcp listener rewrite. Note that we also need to better handle listeners, or we might leak not yet accepted children, because of a missing inet_csk_listen_stop() call. Signed-off-by: NEric Dumazet <edumazet@google.com> Cc: Lorenzo Colitti <lorenzo@google.com> Tested-by: NLorenzo Colitti <lorenzo@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 David Ahern 提交于
Allow accepted sockets to derive their sk_bound_dev_if setting from the l3mdev domain in which the packets originated. A sysctl setting is added to control the behavior which is similar to sk_mark and sysctl_tcp_fwmark_accept. This effectively allow a process to have a "VRF-global" listen socket, with child sockets bound to the VRF device in which the packet originated. A similar behavior can be achieved using sk_mark, but a solution using marks is incomplete as it does not handle duplicate addresses in different L3 domains/VRFs. Allowing sockets to inherit the sk_bound_dev_if from l3mdev domain provides a complete solution. Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 18 12月, 2015 1 次提交
-
-
由 Eric Dumazet 提交于
Yuchung tracked a regression caused by commit 57be5bda ("ip: convert tcp_sendmsg() to iov_iter primitives") for TCP Fast Open. Some Fast Open users do not actually add any data in the SYN packet. Fixes: 57be5bda ("ip: convert tcp_sendmsg() to iov_iter primitives") Reported-by: NYuchung Cheng <ycheng@google.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Acked-by: NYuchung Cheng <ycheng@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 17 12月, 2015 1 次提交
-
-
由 Hannes Frederic Sowa 提交于
fou->udp_offloads is managed by RCU. As it is actually included inside the fou sockets, we cannot let the memory go out of scope before a grace period. We either can synchronize_rcu or switch over to kfree_rcu to manage the sockets. kfree_rcu seems appropriate as it is used by vxlan and geneve. Fixes: 23461551 ("fou: Support for foo-over-udp RX path") Cc: Tom Herbert <tom@herbertland.com> Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-