- 12 12月, 2013 2 次提交
-
-
由 Eric Dumazet 提交于
Unlike TCP, UDP input path does not hold the socket lock. Before messing with sk->sk_rx_dst, we must use a spinlock, otherwise multiple cpus could leak a refcount. This patch also takes care of renewing a stale dst entry. (When the sk->sk_rx_dst would not be used by IP early demux) Fixes: 421b3885 ("udp: ipv4: Add udp early demux") Signed-off-by: NEric Dumazet <edumazet@google.com> Cc: Shawn Bohrer <sbohrer@rgmadvisors.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
pskb_may_pull() can reallocate skb->head, we need to move the initialization of iph and uh pointers after its call. Fixes: 421b3885 ("udp: ipv4: Add udp early demux") Signed-off-by: NEric Dumazet <edumazet@google.com> Cc: Shawn Bohrer <sbohrer@rgmadvisors.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 11 12月, 2013 2 次提交
-
-
由 Eric Dumazet 提交于
Dave Jones reported a use after free in UDP stack : [ 5059.434216] ========================= [ 5059.434314] [ BUG: held lock freed! ] [ 5059.434420] 3.13.0-rc3+ #9 Not tainted [ 5059.434520] ------------------------- [ 5059.434620] named/863 is freeing memory ffff88005e960000-ffff88005e96061f, with a lock still held there! [ 5059.434815] (slock-AF_INET){+.-...}, at: [<ffffffff8149bd21>] udp_queue_rcv_skb+0xd1/0x4b0 [ 5059.435012] 3 locks held by named/863: [ 5059.435086] #0: (rcu_read_lock){.+.+..}, at: [<ffffffff8143054d>] __netif_receive_skb_core+0x11d/0x940 [ 5059.435295] #1: (rcu_read_lock){.+.+..}, at: [<ffffffff81467a5e>] ip_local_deliver_finish+0x3e/0x410 [ 5059.435500] #2: (slock-AF_INET){+.-...}, at: [<ffffffff8149bd21>] udp_queue_rcv_skb+0xd1/0x4b0 [ 5059.435734] stack backtrace: [ 5059.435858] CPU: 0 PID: 863 Comm: named Not tainted 3.13.0-rc3+ #9 [loadavg: 0.21 0.06 0.06 1/115 1365] [ 5059.436052] Hardware name: /D510MO, BIOS MOPNV10J.86A.0175.2010.0308.0620 03/08/2010 [ 5059.436223] 0000000000000002 ffff88007e203ad8 ffffffff8153a372 ffff8800677130e0 [ 5059.436390] ffff88007e203b10 ffffffff8108cafa ffff88005e960000 ffff88007b00cfc0 [ 5059.436554] ffffea00017a5800 ffffffff8141c490 0000000000000246 ffff88007e203b48 [ 5059.436718] Call Trace: [ 5059.436769] <IRQ> [<ffffffff8153a372>] dump_stack+0x4d/0x66 [ 5059.436904] [<ffffffff8108cafa>] debug_check_no_locks_freed+0x15a/0x160 [ 5059.437037] [<ffffffff8141c490>] ? __sk_free+0x110/0x230 [ 5059.437147] [<ffffffff8112da2a>] kmem_cache_free+0x6a/0x150 [ 5059.437260] [<ffffffff8141c490>] __sk_free+0x110/0x230 [ 5059.437364] [<ffffffff8141c5c9>] sk_free+0x19/0x20 [ 5059.437463] [<ffffffff8141cb25>] sock_edemux+0x25/0x40 [ 5059.437567] [<ffffffff8141c181>] sock_queue_rcv_skb+0x81/0x280 [ 5059.437685] [<ffffffff8149bd21>] ? udp_queue_rcv_skb+0xd1/0x4b0 [ 5059.437805] [<ffffffff81499c82>] __udp_queue_rcv_skb+0x42/0x240 [ 5059.437925] [<ffffffff81541d25>] ? _raw_spin_lock+0x65/0x70 [ 5059.438038] [<ffffffff8149bebb>] udp_queue_rcv_skb+0x26b/0x4b0 [ 5059.438155] [<ffffffff8149c712>] __udp4_lib_rcv+0x152/0xb00 [ 5059.438269] [<ffffffff8149d7f5>] udp_rcv+0x15/0x20 [ 5059.438367] [<ffffffff81467b2f>] ip_local_deliver_finish+0x10f/0x410 [ 5059.438492] [<ffffffff81467a5e>] ? ip_local_deliver_finish+0x3e/0x410 [ 5059.438621] [<ffffffff81468653>] ip_local_deliver+0x43/0x80 [ 5059.438733] [<ffffffff81467f70>] ip_rcv_finish+0x140/0x5a0 [ 5059.438843] [<ffffffff81468926>] ip_rcv+0x296/0x3f0 [ 5059.438945] [<ffffffff81430b72>] __netif_receive_skb_core+0x742/0x940 [ 5059.439074] [<ffffffff8143054d>] ? __netif_receive_skb_core+0x11d/0x940 [ 5059.442231] [<ffffffff8108c81d>] ? trace_hardirqs_on+0xd/0x10 [ 5059.442231] [<ffffffff81430d83>] __netif_receive_skb+0x13/0x60 [ 5059.442231] [<ffffffff81431c1e>] netif_receive_skb+0x1e/0x1f0 [ 5059.442231] [<ffffffff814334e0>] napi_gro_receive+0x70/0xa0 [ 5059.442231] [<ffffffffa01de426>] rtl8169_poll+0x166/0x700 [r8169] [ 5059.442231] [<ffffffff81432bc9>] net_rx_action+0x129/0x1e0 [ 5059.442231] [<ffffffff810478cd>] __do_softirq+0xed/0x240 [ 5059.442231] [<ffffffff81047e25>] irq_exit+0x125/0x140 [ 5059.442231] [<ffffffff81004241>] do_IRQ+0x51/0xc0 [ 5059.442231] [<ffffffff81542bef>] common_interrupt+0x6f/0x6f We need to keep a reference on the socket, by using skb_steal_sock() at the right place. Note that another patch is needed to fix a race in udp_sk_rx_dst_set(), as we hold no lock protecting the dst. Fixes: 421b3885 ("udp: ipv4: Add udp early demux") Reported-by: NDave Jones <davej@redhat.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Cc: Shawn Bohrer <sbohrer@rgmadvisors.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Stefan Tomanek 提交于
This changes ensures that the routing entry investigated by the suppress function actually does point to a device struct before following that pointer, fixing a possible kernel oops situation when verifying the interface group associated with a routing table entry. According to Daniel Golle, this Oops can be triggered by a user process trying to establish an outgoing IPv6 connection while having no real IPv6 connectivity set up (only autoassigned link-local addresses). Fixes: 6ef94cfa ("fib_rules: add route suppression based on ifgroup") Reported-by: NDaniel Golle <daniel.golle@gmail.com> Tested-by: NDaniel Golle <daniel.golle@gmail.com> Signed-off-by: NStefan Tomanek <stefan.tomanek@wertarbyte.de> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 06 12月, 2013 1 次提交
-
-
由 Eric W. Biederman 提交于
kill memcg_tcp_enter_memory_pressure. The only function of memcg_tcp_enter_memory_pressure was to reduce deal with the unnecessary abstraction that was tcp_memcontrol. Now that struct tcp_memcontrol is gone remove this unnecessary function, the unnecessary function pointer, and modify sk_enter_memory_pressure to set this field directly, just as sk_leave_memory_pressure cleas this field directly. This fixes a small bug I intruduced when killing struct tcp_memcontrol that caused memcg_tcp_enter_memory_pressure to never be called and thus failed to ever set cg_proto->memory_pressure. Remove the cg_proto enter_memory_pressure function as it now serves no useful purpose. Don't test cg_proto->memory_presser in sk_leave_memory_pressure before clearing it. The test was originally there to ensure that the pointer was non-NULL. Now that cg_proto is not a pointer the pointer does not matter. Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 30 11月, 2013 2 次提交
-
-
由 Eric Dumazet 提交于
In commit c9e90429 ("ipv4: fix possible seqlock deadlock") I left another places where IP_INC_STATS_BH() were improperly used. udp_sendmsg(), ping_v4_sendmsg() and tcp_v4_connect() are called from process context, not from softirq context. This was detected by lockdep seqlock support. Reported-by: Njongman heo <jongman.heo@samsung.com> Fixes: 584bdf8c ("[IPV4]: Fix "ipOutNoRoutes" counter error for TCP and UDP") Fixes: c319b4d7 ("net: ipv4: add IPPROTO_ICMP socket kind") Signed-off-by: NEric Dumazet <edumazet@google.com> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Shawn Landden 提交于
Commit 35f9c09f (tcp: tcp_sendpages() should call tcp_push() once) added an internal flag MSG_SENDPAGE_NOTLAST, similar to MSG_MORE. algif_hash, algif_skcipher, and udp used MSG_MORE from tcp_sendpages() and need to see the new flag as identical to MSG_MORE. This fixes sendfile() on AF_ALG. v3: also fix udp Cc: Tom Herbert <therbert@google.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: David S. Miller <davem@davemloft.net> Cc: <stable@vger.kernel.org> # 3.4.x + 3.2.x Reported-and-tested-by: NShawn Landden <shawnlandden@gmail.com> Original-patch: Richard Weinberger <richard@nod.at> Signed-off-by: NShawn Landden <shawn@churchofgit.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 29 11月, 2013 1 次提交
-
-
由 Baker Zhang 提交于
since f9242b6b inet: Sanitize inet{,6} protocol demux. there are not pretended hash tables for ipv4 or ipv6 protocol handler. Signed-off-by: NBaker Zhang <Baker.kernel@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 24 11月, 2013 4 次提交
-
-
由 Hannes Frederic Sowa 提交于
Commit bceaa902 ("inet: prevent leakage of uninitialized memory to user in recv syscalls") conditionally updated addr_len if the msg_name is written to. The recv_error and rxpmtu functions relied on the recvmsg functions to set up addr_len before. As this does not happen any more we have to pass addr_len to those functions as well and set it to the size of the corresponding sockaddr length. This broke traceroute and such. Fixes: bceaa902 ("inet: prevent leakage of uninitialized memory to user in recv syscalls") Reported-by: NBrad Spengler <spender@grsecurity.net> Reported-by: Tom Labanowski Cc: mpb <mpb.mail@gmail.com> Cc: David S. Miller <davem@davemloft.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Gao feng 提交于
nobody needs it. remove. Signed-off-by: NGao feng <gaofeng@cn.fujitsu.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Herbert Xu 提交于
This patch simplifies the checksum verification in tcpX_gro_receive by reusing the CHECKSUM_COMPLETE code for CHECKSUM_NONE. All it does for CHECKSUM_NONE is compute the partial checksum and then treat it as if it came from the hardware (CHECKSUM_COMPLETE). Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au> Cheers, Acked-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Herbert Xu 提交于
In some cases we may receive IP packets that are longer than their stated lengths. Such packets are never merged in GRO. However, we may end up computing their checksums incorrectly and end up allowing packets with a bogus checksum enter our stack with the checksum status set as verified. Since such packets are rare and not performance-critical, this patch simply skips the checksum verification for them. Reported-by: NAlexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au> Acked-by: NAlexander Duyck <alexander.h.duyck@intel.com> Thanks, Acked-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 21 11月, 2013 1 次提交
-
-
由 Alexei Starovoitov 提交于
CPUs can ask for local route via ip_route_input_noref() concurrently. if nh_rth_input is not cached yet, CPUs will proceed to allocate equivalent DSTs on 'lo' and then will try to cache them in nh_rth_input via rt_cache_route() Most of the time they succeed, but on occasion the following two lines: orig = *p; prev = cmpxchg(p, orig, rt); in rt_cache_route() do race and one of the cpus fails to complete cmpxchg. But ip_route_input_slow() doesn't check the return code of rt_cache_route(), so dst is leaking. dst_destroy() is never called and 'lo' device refcnt doesn't go to zero, which can be seen in the logs as: unregister_netdevice: waiting for lo to become free. Usage count = 1 Adding mdelay() between above two lines makes it easily reproducible. Fix it similar to nh_pcpu_rth_output case. Fixes: d2d68ba9 ("ipv4: Cache input routes in fib_info nexthops.") Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 20 11月, 2013 3 次提交
-
-
由 Johannes Berg 提交于
As suggested by David Miller, make genl_register_family_with_ops() a macro and pass only the array, evaluating ARRAY_SIZE() in the macro, this is a little safer. The openvswitch has some indirection, assing ops/n_ops directly in that code. This might ultimately just assign the pointers in the family initializations, saving the struct genl_family_and_ops and code (once mcast groups are handled differently.) Signed-off-by: NJohannes Berg <johannes.berg@intel.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Andrey Vagin 提交于
snd_nxt must be updated synchronously with sk_send_head. Otherwise tp->packets_out may be updated incorrectly, what may bring a kernel panic. Here is a kernel panic from my host. [ 103.043194] BUG: unable to handle kernel NULL pointer dereference at 0000000000000048 [ 103.044025] IP: [<ffffffff815aaaaf>] tcp_rearm_rto+0xcf/0x150 ... [ 146.301158] Call Trace: [ 146.301158] [<ffffffff815ab7f0>] tcp_ack+0xcc0/0x12c0 Before this panic a tcp socket was restored. This socket had sent and unsent data in the write queue. Sent data was restored in repair mode, then the socket was switched from reapair mode and unsent data was restored. After that the socket was switched back into repair mode. In that moment we had a socket where write queue looks like this: snd_una snd_nxt write_seq |_________|________| | sk_send_head After a second switching from repair mode the state of socket was changed: snd_una snd_nxt, write_seq |_________ ________| | sk_send_head This state is inconsistent, because snd_nxt and sk_send_head are not synchronized. Bellow you can find a call trace, how packets_out can be incremented twice for one skb, if snd_nxt and sk_send_head are not synchronized. In this case packets_out will be always positive, even when sk_write_queue is empty. tcp_write_wakeup skb = tcp_send_head(sk); tcp_fragment if (!before(tp->snd_nxt, TCP_SKB_CB(buff)->end_seq)) tcp_adjust_pcount(sk, skb, diff); tcp_event_new_data_sent tp->packets_out += tcp_skb_pcount(skb); I think update of snd_nxt isn't required, when a socket is switched from repair mode. Because it's initialized in tcp_connect_init. Then when a write queue is restored, snd_nxt is incremented in tcp_event_new_data_sent, so it's always is in consistent state. I have checked, that the bug is not reproduced with this patch and all tests about restoring tcp connections work fine. Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Eric Dumazet <edumazet@google.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: James Morris <jmorris@namei.org> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> Signed-off-by: NAndrey Vagin <avagin@openvz.org> Acked-by: NPavel Emelyanov <xemul@parallels.com> Acked-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 fan.du 提交于
After searching rt by the vti tunnel dst/src parameter, if this rt has neither attached to any transformation nor the transformation is not tunnel oriented, this rt should be released back to ip layer. otherwise causing dst memory leakage. Signed-off-by: NFan Du <fan.du@windriver.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 19 11月, 2013 2 次提交
-
-
由 Hannes Frederic Sowa 提交于
A plain read() on a socket does set msg->msg_name to NULL. So check for NULL pointer first. Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Hannes Frederic Sowa 提交于
Only update *addr_len when we actually fill in sockaddr, otherwise we can return uninitialized memory from the stack to the caller in the recvfrom, recvmmsg and recvmsg syscalls. Drop the the (addr_len == NULL) checks because we only get called with a valid addr_len pointer either from sock_common_recvmsg or inet_recvmsg. If a blocking read waits on a socket which is concurrently shut down we now return zero and set msg_msgnamelen to 0. Reported-by: Nmpb <mpb.mail@gmail.com> Suggested-by: NEric Dumazet <eric.dumazet@gmail.com> Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 18 11月, 2013 1 次提交
-
-
由 Martin Topholm 提交于
When the synproxy_parse_options is called on the client ack the mss option will not be present. Consequently mss wont be included in the backend syn packet, which falls back to 536 bytes mss. Therefore XT_SYNPROXY_OPT_MSS is explicitly flagged when recovering mss value from cookie. Signed-off-by: NMartin Topholm <mph@one.com> Reviewed-by: NJesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
-
- 15 11月, 2013 5 次提交
-
-
由 Tetsuo Handa 提交于
All seq_printf() users are using "%n" for calculating padding size, convert them to use seq_setwidth() / seq_pad() pair. Signed-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: NKees Cook <keescook@chromium.org> Cc: Joe Perches <joe@perches.com> Cc: David Miller <davem@davemloft.net> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Eric Dumazet 提交于
ip4_datagram_connect() being called from process context, it should use IP_INC_STATS() instead of IP_INC_STATS_BH() otherwise we can deadlock on 32bit arches, or get corruptions of SNMP counters. Fixes: 584bdf8c ("[IPV4]: Fix "ipOutNoRoutes" counter error for TCP and UDP") Signed-off-by: NEric Dumazet <edumazet@google.com> Reported-by: NDave Jones <davej@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Johannes Berg 提交于
Now that genl_ops are no longer modified in place when registering, they can be made const. This patch was done mostly with spatch: @@ identifier ops; @@ +const struct genl_ops ops[] = { ... }; (except the struct thing in net/openvswitch/datapath.c) Signed-off-by: NJohannes Berg <johannes.berg@intel.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
We had some reports of crashes using TCP fastopen, and Dave Jones gave a nice stack trace pointing to the error. Issue is that tcp_get_metrics() should not be called with a NULL dst Fixes: 1fe4c481 ("net-tcp: Fast Open client - cookie cache") Signed-off-by: NEric Dumazet <edumazet@google.com> Reported-by: NDave Jones <davej@redhat.com> Cc: Yuchung Cheng <ycheng@google.com> Acked-by: NYuchung Cheng <ycheng@google.com> Tested-by: NDave Jones <davej@fedoraproject.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
After commit c9eeec26 ("tcp: TSQ can use a dynamic limit"), several users reported throughput regressions, notably on mvneta and wifi adapters. 802.11 AMPDU requires a fair amount of queueing to be effective. This patch partially reverts the change done in tcp_write_xmit() so that the minimal amount is sysctl_tcp_limit_output_bytes. It also remove the use of this sysctl while building skb stored in write queue, as TSO autosizing does the right thing anyway. Users with well behaving NICS and correct qdisc (like sch_fq), can then lower the default sysctl_tcp_limit_output_bytes value from 128KB to 8KB. This new usage of sysctl_tcp_limit_output_bytes permits each driver authors to check how their driver performs when/if the value is set to a minimum of 4KB. Normally, line rate for a single TCP flow should be possible, but some drivers rely on timers to perform TX completion and too long TX completion delays prevent reaching full throughput. Fixes: c9eeec26 ("tcp: TSQ can use a dynamic limit") Signed-off-by: NEric Dumazet <edumazet@google.com> Reported-by: NSujith Manoharan <sujith@msujith.org> Reported-by: NArnaud Ebalard <arno@natisbad.org> Tested-by: NSujith Manoharan <sujith@msujith.org> Cc: Felix Fietkau <nbd@openwrt.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 14 11月, 2013 1 次提交
-
-
由 Alexei Starovoitov 提交于
commit 06a23fe3 ("core/dev: set pkt_type after eth_type_trans() in dev_forward_skb()") and refactoring 64261f23 ("dev: move skb_scrub_packet() after eth_type_trans()") are forcing pkt_type to be PACKET_HOST when skb traverses veth. which means that ip forwarding will kick in inside netns even if skb->eth->h_dest != dev->dev_addr Fix order of eth_type_trans() and skb_scrub_packet() in dev_forward_skb() and in ip_tunnel_rcv() Fixes: 06a23fe3 ("core/dev: set pkt_type after eth_type_trans() in dev_forward_skb()") CC: Isaku Yamahata <yamahatanetdev@gmail.com> CC: Maciej Zenczykowski <zenczykowski@gmail.com> CC: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 08 11月, 2013 1 次提交
-
-
由 Eric Dumazet 提交于
While testing virtio_net and skb_segment() changes, Hannes reported that UFO was sending wrong frames. It appears this was introduced by a recent commit : 8c3a897b ("inet: restore gso for vxlan") The old condition to perform IP frag was : tunnel = !!skb->encapsulation; ... if (!tunnel && proto == IPPROTO_UDP) { So the new one should be : udpfrag = !skb->encapsulation && proto == IPPROTO_UDP; ... if (udpfrag) { Initialization of udpfrag must be done before call to ops->callbacks.gso_segment(skb, features), as skb_udp_tunnel_segment() clears skb->encapsulation (We want udpfrag to be true for UFO, false for VXLAN) With help from Alexei Starovoitov Reported-by: NHannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: NEric Dumazet <edumazet@google.com> Cc: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 06 11月, 2013 2 次提交
-
-
由 John Stultz 提交于
In order to enable lockdep on seqcount/seqlock structures, we must explicitly initialize any locks. The u64_stats_sync structure, uses a seqcount, and thus we need to introduce a u64_stats_init() function and use it to initialize the structure. This unfortunately adds a lot of fairly trivial initialization code to a number of drivers. But the benefit of ensuring correctness makes this worth while. Because these changes are required for lockdep to be enabled, and the changes are quite trivial, I've not yet split this patch out into 30-some separate patches, as I figured it would be better to get the various maintainers thoughts on how to best merge this change along with the seqcount lockdep enablement. Feedback would be appreciated! Signed-off-by: NJohn Stultz <john.stultz@linaro.org> Acked-by: NJulian Anastasov <ja@ssi.bg> Signed-off-by: NPeter Zijlstra <peterz@infradead.org> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: James Morris <jmorris@namei.org> Cc: Jesse Gross <jesse@nicira.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Mirko Lindner <mlindner@marvell.com> Cc: Patrick McHardy <kaber@trash.net> Cc: Roger Luethi <rl@hellgate.ch> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Simon Horman <horms@verge.net.au> Cc: Stephen Hemminger <stephen@networkplumber.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com> Cc: Wensong Zhang <wensong@linux-vs.org> Cc: netdev@vger.kernel.org Link: http://lkml.kernel.org/r/1381186321-4906-2-git-send-email-john.stultz@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Hannes Frederic Sowa 提交于
Sockets marked with IP_PMTUDISC_INTERFACE won't do path mtu discovery, their sockets won't accept and install new path mtu information and they will always use the interface mtu for outgoing packets. It is guaranteed that the packet is not fragmented locally. But we won't set the DF-Flag on the outgoing frames. Florian Weimer had the idea to use this flag to ensure DNS servers are never generating outgoing fragments. They may well be fragmented on the path, but the server never stores or usees path mtu values, which could well be forged in an attack. (The root of the problem with path MTU discovery is that there is no reliable way to authenticate ICMP Fragmentation Needed But DF Set messages because they are sent from intermediate routers with their source addresses, and the IMCP payload will not always contain sufficient information to identify a flow.) Recent research in the DNS community showed that it is possible to implement an attack where DNS cache poisoning is feasible by spoofing fragments. This work was done by Amir Herzberg and Haya Shulman: <https://sites.google.com/site/hayashulman/files/fragmentation-poisoning.pdf> This issue was previously discussed among the DNS community, e.g. <http://www.ietf.org/mail-archive/web/dnsext/current/msg01204.html>, without leading to fixes. This patch depends on the patch "ipv4: fix DO and PROBE pmtu mode regarding local fragmentation with UFO/CORK" for the enforcement of the non-fragmentable checks. If other users than ip_append_page/data should use this semantic too, we have to add a new flag to IPCB(skb)->flags to suppress local fragmentation and check for this in ip_finish_output. Many thanks to Florian Weimer for the idea and feedback while implementing this patch. Cc: David S. Miller <davem@davemloft.net> Suggested-by: NFlorian Weimer <fweimer@redhat.com> Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 05 11月, 2013 2 次提交
-
-
由 Yuchung Cheng 提交于
Slow start now increases cwnd by 1 if an ACK acknowledges some packets, regardless the number of packets. Consequently slow start performance is highly dependent on the degree of the stretch ACKs caused by receiver or network ACK compression mechanisms (e.g., delayed-ACK, GRO, etc). But slow start algorithm is to send twice the amount of packets of packets left so it should process a stretch ACK of degree N as if N ACKs of degree 1, then exits when cwnd exceeds ssthresh. A follow up patch will use the remainder of the N (if greater than 1) to adjust cwnd in the congestion avoidance phase. In addition this patch retires the experimental limited slow start (LSS) feature. LSS has multiple drawbacks but questionable benefit. The fractional cwnd increase in LSS requires a loop in slow start even though it's rarely used. Configuring such an increase step via a global sysctl on different BDPS seems hard. Finally and most importantly the slow start overshoot concern is now better covered by the Hybrid slow start (hystart) enabled by default. Signed-off-by: NYuchung Cheng <ycheng@google.com> Signed-off-by: NNeal Cardwell <ncardwell@google.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Yuchung Cheng 提交于
Applications have started to use Fast Open (e.g., Chrome browser has such an optional flag) and the feature has gone through several generations of kernels since 3.7 with many real network tests. It's time to enable this flag by default for applications to test more conveniently and extensively. Signed-off-by: NYuchung Cheng <ycheng@google.com> Signed-off-by: NNeal Cardwell <ncardwell@google.com> Acked-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 04 11月, 2013 1 次提交
-
-
由 Wei Yongjun 提交于
Remove duplicated include. Signed-off-by: NWei Yongjun <yongjun_wei@trendmicro.com.cn> Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
-
- 01 11月, 2013 1 次提交
-
-
由 Steffen Klassert 提交于
On some codepaths the skb does not have a dst entry when xfrm_decode_session() is called. So check for a valid skb_dst() before dereferencing the device interface index. We use 0 as the device index if there is no valid skb_dst(), or at reverse decoding we use skb_iif as device interface index. Bug was introduced with git commit bafd4bd4 ("xfrm: Decode sessions with output interface."). Reported-by: NMeelis Roos <mroos@linux.ee> Tested-by: NMeelis Roos <mroos@linux.ee> Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>
-
- 30 10月, 2013 1 次提交
-
-
由 Yuchung Cheng 提交于
Fast Open currently has a fall back feature to address SYN-data being dropped but it requires the middle-box to pass on regular SYN retry after SYN-data. This is implemented in commit aab48743 ("net-tcp: Fast Open client - detecting SYN-data drops") However some NAT boxes will drop all subsequent packets after first SYN-data and blackholes the entire connections. An example is in commit 356d7d88 "netfilter: nf_conntrack: fix tcp_in_window for Fast Open". The sender should note such incidents and fall back to use the regular TCP handshake on subsequent attempts temporarily as well: after the second SYN timeouts the original Fast Open SYN is most likely lost. When such an event recurs Fast Open is disabled based on the number of recurrences exponentially. Signed-off-by: NYuchung Cheng <ycheng@google.com> Signed-off-by: NNeal Cardwell <ncardwell@google.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 29 10月, 2013 4 次提交
-
-
由 Mathias Krause 提交于
struct esp_data consists of a single pointer, vanishing the need for it to be a structure. Fold the pointer into 'data' direcly, removing one level of pointer indirection. Signed-off-by: NMathias Krause <mathias.krause@secunet.com> Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>
-
由 Mathias Krause 提交于
The padlen member of struct esp_data is always zero. Get rid of it. Signed-off-by: NMathias Krause <mathias.krause@secunet.com> Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>
-
由 Hannes Frederic Sowa 提交于
UFO as well as UDP_CORK do not respect IP_PMTUDISC_DO and IP_PMTUDISC_PROBE well enough. UFO enabled packet delivery just appends all frags to the cork and hands it over to the network card. So we just deliver non-DF udp fragments (DF-flag may get overwritten by hardware or virtual UFO enabled interface). UDP_CORK does enqueue the data until the cork is disengaged. At this point it sets the correct IP_DF and local_df flags and hands it over to ip_fragment which in this case will generate an icmp error which gets appended to the error socket queue. This is not reflected in the syscall error (of course, if UFO is enabled this also won't happen). Improve this by checking the pmtudisc flags before appending data to the socket and if we still can fit all data in one packet when IP_PMTUDISC_DO or IP_PMTUDISC_PROBE is set, only then proceed. We use (mtu-fragheaderlen) to check for the maximum length because we ensure not to generate a fragment and non-fragmented data does not need to have its length aligned on 64 bit boundaries. Also the passed in ip_options are already aligned correctly. Maybe, we can relax some other checks around ip_fragment. This needs more research. Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
commit 6ff50cd5 ("tcp: gso: do not generate out of order packets") had an heuristic that can trigger a warning in skb_try_coalesce(), because skb->truesize of the gso segments were exactly set to mss. This breaks the requirement that skb->truesize >= skb->len + truesizeof(struct sk_buff); It can trivially be reproduced by : ifconfig lo mtu 1500 ethtool -K lo tso off netperf As the skbs are looped into the TCP networking stack, skb_try_coalesce() warns us of these skb under-estimating their truesize. Signed-off-by: NEric Dumazet <edumazet@google.com> Reported-by: NAlexei Starovoitov <ast@plumgrid.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 28 10月, 2013 3 次提交
-
-
由 Steffen Klassert 提交于
With the removal of the routing cache, we lost the option to tweak the garbage collector threshold along with the maximum routing cache size. So git commit 703fb94e ("xfrm: Fix the gc threshold value for ipv4") moved back to a static threshold. It turned out that the current threshold before we start garbage collecting is much to small for some workloads, so increase it from 1024 to 32768. This means that we start the garbage collector if we have more than 32768 dst entries in the system and refuse new allocations if we are above 65536. Reported-by: NWolfgang Walter <linux@stwm.de> Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>
-
由 Eric Dumazet 提交于
Alexei reported a performance regression on vxlan, caused by commit 3347c960 "ipv4: gso: make inet_gso_segment() stackable" GSO vxlan packets were not properly segmented, adding IP fragments while they were not expected. Rename 'bool tunnel' to 'bool encap', and add a new boolean to express the fact that UDP should be fragmented. This fragmentation is triggered by skb->encapsulation being set. Remove a "skb->encapsulation = 1" added in above commit, as its not needed, as frags inherit skb->frag from original GSO skb. Reported-by: NAlexei Starovoitov <ast@plumgrid.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Tested-by: NAlexei Starovoitov <ast@plumgrid.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Yuchung Cheng 提交于
Patch ed08495c "tcp: use RTT from SACK for RTO" always re-arms RTO upon obtaining a RTT sample from newly sacked data. But technically RTO should only be re-armed when the data sent before the last (re)transmission of write queue head are (s)acked. Otherwise the RTO may continue to extend during loss recovery on data sent in the future. Note that RTTs from ACK or timestamps do not have this problem, as the RTT source must be from data sent before. The new RTO re-arm policy is 1) Always re-arm RTO if SND.UNA is advanced 2) Re-arm RTO if sack RTT is available, provided the sacked data was sent before the last time write_queue_head was sent. Signed-off-by: NLarry Brakmo <brakmo@google.com> Signed-off-by: NYuchung Cheng <ycheng@google.com> Acked-by: NNeal Cardwell <ncardwell@google.com> Acked-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-