提交 · 77f99ad16a07aa062c2d30fae57b1fee456f6ef6 · openeuler / raspberrypi-kernel

18 1月, 2014 1 次提交

tcp: metrics: Avoid duplicate entries with the same destination-IP · 77f99ad1

由 Christoph Paasch 提交于 1月 16, 2014

Because the tcp-metrics is an RCU-list, it may be that two
soft-interrupts are inside __tcp_get_metrics() for the same
destination-IP at the same time. If this destination-IP is not yet part of
the tcp-metrics, both soft-interrupts will end up in tcpm_new and create
a new entry for this IP.
So, we will have two tcp-metrics with the same destination-IP in the list.

This patch checks twice __tcp_get_metrics(). First without holding the
lock, then while holding the lock. The second one is there to confirm
that the entry has not been added by another soft-irq while waiting for
the spin-lock.

Fixes: 51c5d0c4 (tcp: Maintain dynamic metrics in local cache.)
Signed-off-by: NChristoph Paasch <christoph.paasch@uclouvain.be>
Reviewed-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

77f99ad1

15 1月, 2014 1 次提交

net: avoid reference counter overflows on fib_rules in multicast forwarding · 95f4a45d

由 Hannes Frederic Sowa 提交于 1月 13, 2014

Bob Falken reported that after 4G packets, multicast forwarding stopped
working. This was because of a rule reference counter overflow which
freed the rule as soon as the overflow happend.

This patch solves this by adding the FIB_LOOKUP_NOREF flag to
fib_rules_lookup calls. This is safe even from non-rcu locked sections
as in this case the flag only implies not taking a reference to the rule,
which we don't need at all.

Rules only hold references to the namespace, which are guaranteed to be
available during the call of the non-rcu protected function reg_vif_xmit
because of the interface reference which itself holds a reference to
the net namespace.

Fixes: f0ad0860 ("ipv4: ipmr: support multiple tables")
Fixes: d1db275d ("ipv6: ip6mr: support multiple tables")
Reported-by: NBob Falken <NetFestivalHaveFun@gmx.com>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Thomas Graf <tgraf@suug.ch>
Cc: Julian Anastasov <ja@ssi.bg>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

95f4a45d

14 1月, 2014 1 次提交

inet_diag: fix inet_diag_dump_icsk() to use correct state for timewait sockets · 70315d22

由 Neal Cardwell 提交于 1月 10, 2014

Fix inet_diag_dump_icsk() to reflect the fact that both TCP_TIME_WAIT
and TCP_FIN_WAIT2 connections are represented by inet_timewait_sock
(not just TIME_WAIT), and for such sockets the tw_substate field holds
the real state, which can be either TCP_TIME_WAIT or TCP_FIN_WAIT2.

This brings the inet_diag state-matching code in line with the field
it uses to populate idiag_state. This is also analogous to the info
exported in /proc/net/tcp, where get_tcp4_sock() exports sk->sk_state
and get_timewait4_sock() exports tw->tw_substate.

Before fixing this, (a) neither "ss -nemoi" nor "ss -nemoi state
fin-wait-2" would return a socket in TCP_FIN_WAIT2; and (b) "ss -nemoi
state time-wait" would also return sockets in state TCP_FIN_WAIT2.

This is an old bug that predates 05dbc7b5 ("tcp/dccp: remove twchain").
Signed-off-by: NNeal Cardwell <ncardwell@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

70315d22

03 1月, 2014 1 次提交

ipv4: fix tunneled VM traffic over hw VXLAN/GRE GSO NIC · 7a7ffbab

由 Wei-Chun Chao 提交于 12月 26, 2013

VM to VM GSO traffic is broken if it goes through VXLAN or GRE
tunnel and the physical NIC on the host supports hardware VXLAN/GRE
GSO offload (e.g. bnx2x and next-gen mlx4).

Two issues -
(VXLAN) VM traffic has SKB_GSO_DODGY and SKB_GSO_UDP_TUNNEL with
SKB_GSO_TCP/UDP set depending on the inner protocol. GSO header
integrity check fails in udp4_ufo_fragment if inner protocol is
TCP. Also gso_segs is calculated incorrectly using skb->len that
includes tunnel header. Fix: robust check should only be applied
to the inner packet.

(VXLAN & GRE) Once GSO header integrity check passes, NULL segs
is returned and the original skb is sent to hardware. However the
tunnel header is already pulled. Fix: tunnel header needs to be
restored so that hardware can perform GSO properly on the original
packet.
Signed-off-by: NWei-Chun Chao <weichunc@plumgrid.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7a7ffbab

23 12月, 2013 1 次提交

ipv4: consistent reporting of pmtu data in case of corking · 61e7f09d

由 Hannes Frederic Sowa 提交于 12月 19, 2013

We report different pmtu values back on the first write and on further
writes on an corked socket.

Also don't include the dst.header_len (respectively exthdrlen) as this
should already be dealt with by the interface mtu of the outgoing
(virtual) interface and policy of that interface should dictate if
fragmentation should happen.

Instead reduce the pmtu data by IP options as we do for IPv6. Make the
same changes for ip_append_data, where we did not care about options or
dst.header_len at all.
Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

61e7f09d

20 12月, 2013 1 次提交

net: inet_diag: zero out uninitialized idiag_{src,dst} fields · b1aac815

由 Daniel Borkmann 提交于 12月 17, 2013

Jakub reported while working with nlmon netlink sniffer that parts of
the inet_diag_sockid are not initialized when r->idiag_family != AF_INET6.
That is, fields of r->id.idiag_src[1 ... 3], r->id.idiag_dst[1 ... 3].

In fact, it seems that we can leak 6 * sizeof(u32) byte of kernel [slab]
memory through this. At least, in udp_dump_one(), we allocate a skb in ...

  rep = nlmsg_new(sizeof(struct inet_diag_msg) + ..., GFP_KERNEL);

... and then pass that to inet_sk_diag_fill() that puts the whole struct
inet_diag_msg into the skb, where we only fill out r->id.idiag_src[0],
r->id.idiag_dst[0] and leave the rest untouched:

  r->id.idiag_src[0] = inet->inet_rcv_saddr;
  r->id.idiag_dst[0] = inet->inet_daddr;

struct inet_diag_msg embeds struct inet_diag_sockid that is correctly /
fully filled out in IPv6 case, but for IPv4 not.

So just zero them out by using plain memset (for this little amount of
bytes it's probably not worth the extra check for idiag_family == AF_INET).

Similarly, fix also other places where we fill that out.
Reported-by: NJakub Zawadzki <darkjames-ws@darkjames.pl>
Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b1aac815

19 12月, 2013 1 次提交

ip_gre: fix msg_name parsing for recvfrom/recvmsg · 0e3da5bb

由 Timo Teräs 提交于 12月 16, 2013

ipgre_header_parse() needs to parse the tunnel's ip header and it
uses mac_header to locate the iphdr. This got broken when gre tunneling
was refactored as mac_header is no longer updated to point to iphdr.
Introduce skb_pop_mac_header() helper to do the mac_header assignment
and use it in ipgre_rcv() to fix msg_name parsing.

Bug introduced in commit c5441932 (GRE: Refactor GRE tunneling code.)

Cc: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: NTimo Teräs <timo.teras@iki.fi>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0e3da5bb

18 12月, 2013 1 次提交

udp: ipv4: do not use sk_dst_lock from softirq context · e47eb5df

由 Eric Dumazet 提交于 12月 15, 2013

Using sk_dst_lock from softirq context is not supported right now.

Instead of adding BH protection everywhere,
udp_sk_rx_dst_set() can instead use xchg(), as suggested
by David.
Reported-by: NFengguang Wu <fengguang.wu@intel.com>
Fixes: 97502231 ("udp: ipv4: must add synchronization in udp_sk_rx_dst_set()")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e47eb5df

12 12月, 2013 3 次提交

netfilter: nft_reject: fix endianness in dump function · a3adadf3

由 Eric Leblond 提交于 12月 12, 2013

The dump function in nft_reject_ipv4 was not converting a u32
field to network order before sending it to userspace, this
needs to happen for consistency with other nf_tables and
nfnetlink subsystems.
Signed-off-by: NEric Leblond <eric@regit.org>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

a3adadf3

udp: ipv4: must add synchronization in udp_sk_rx_dst_set() · 97502231

由 Eric Dumazet 提交于 12月 11, 2013

Unlike TCP, UDP input path does not hold the socket lock.

Before messing with sk->sk_rx_dst, we must use a spinlock, otherwise
multiple cpus could leak a refcount.

This patch also takes care of renewing a stale dst entry.
(When the sk->sk_rx_dst would not be used by IP early demux)

Fixes: 421b3885 ("udp: ipv4: Add udp early demux")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Shawn Bohrer <sbohrer@rgmadvisors.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

97502231

udp: ipv4: fix potential use after free in udp_v4_early_demux() · 610438b7

由 Eric Dumazet 提交于 12月 11, 2013

pskb_may_pull() can reallocate skb->head, we need to move the
initialization of iph and uh pointers after its call.

Fixes: 421b3885 ("udp: ipv4: Add udp early demux")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Shawn Bohrer <sbohrer@rgmadvisors.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

610438b7

11 12月, 2013 3 次提交

netfilter: SYNPROXY target: restrict to INPUT/FORWARD · f01b3926

由 Patrick McHardy 提交于 12月 08, 2013

Fix a crash in synproxy_send_tcp() when using the SYNPROXY target in the
PREROUTING chain caused by missing routing information.
Reported-by: NNicki P. <xastx@gmx.de>
Signed-off-by: NPatrick McHardy <kaber@trash.net>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

f01b3926

udp: ipv4: fix an use after free in __udp4_lib_rcv() · 8afdd99a

由 Eric Dumazet 提交于 12月 10, 2013

Dave Jones reported a use after free in UDP stack :

[ 5059.434216] =========================
[ 5059.434314] [ BUG: held lock freed! ]
[ 5059.434420] 3.13.0-rc3+ #9 Not tainted
[ 5059.434520] -------------------------
[ 5059.434620] named/863 is freeing memory ffff88005e960000-ffff88005e96061f, with a lock still held there!
[ 5059.434815]  (slock-AF_INET){+.-...}, at: [<ffffffff8149bd21>] udp_queue_rcv_skb+0xd1/0x4b0
[ 5059.435012] 3 locks held by named/863:
[ 5059.435086]  #0:  (rcu_read_lock){.+.+..}, at: [<ffffffff8143054d>] __netif_receive_skb_core+0x11d/0x940
[ 5059.435295]  #1:  (rcu_read_lock){.+.+..}, at: [<ffffffff81467a5e>] ip_local_deliver_finish+0x3e/0x410
[ 5059.435500]  #2:  (slock-AF_INET){+.-...}, at: [<ffffffff8149bd21>] udp_queue_rcv_skb+0xd1/0x4b0
[ 5059.435734]
stack backtrace:
[ 5059.435858] CPU: 0 PID: 863 Comm: named Not tainted 3.13.0-rc3+ #9 [loadavg: 0.21 0.06 0.06 1/115 1365]
[ 5059.436052] Hardware name:                  /D510MO, BIOS MOPNV10J.86A.0175.2010.0308.0620 03/08/2010
[ 5059.436223]  0000000000000002 ffff88007e203ad8 ffffffff8153a372 ffff8800677130e0
[ 5059.436390]  ffff88007e203b10 ffffffff8108cafa ffff88005e960000 ffff88007b00cfc0
[ 5059.436554]  ffffea00017a5800 ffffffff8141c490 0000000000000246 ffff88007e203b48
[ 5059.436718] Call Trace:
[ 5059.436769]  <IRQ>  [<ffffffff8153a372>] dump_stack+0x4d/0x66
[ 5059.436904]  [<ffffffff8108cafa>] debug_check_no_locks_freed+0x15a/0x160
[ 5059.437037]  [<ffffffff8141c490>] ? __sk_free+0x110/0x230
[ 5059.437147]  [<ffffffff8112da2a>] kmem_cache_free+0x6a/0x150
[ 5059.437260]  [<ffffffff8141c490>] __sk_free+0x110/0x230
[ 5059.437364]  [<ffffffff8141c5c9>] sk_free+0x19/0x20
[ 5059.437463]  [<ffffffff8141cb25>] sock_edemux+0x25/0x40
[ 5059.437567]  [<ffffffff8141c181>] sock_queue_rcv_skb+0x81/0x280
[ 5059.437685]  [<ffffffff8149bd21>] ? udp_queue_rcv_skb+0xd1/0x4b0
[ 5059.437805]  [<ffffffff81499c82>] __udp_queue_rcv_skb+0x42/0x240
[ 5059.437925]  [<ffffffff81541d25>] ? _raw_spin_lock+0x65/0x70
[ 5059.438038]  [<ffffffff8149bebb>] udp_queue_rcv_skb+0x26b/0x4b0
[ 5059.438155]  [<ffffffff8149c712>] __udp4_lib_rcv+0x152/0xb00
[ 5059.438269]  [<ffffffff8149d7f5>] udp_rcv+0x15/0x20
[ 5059.438367]  [<ffffffff81467b2f>] ip_local_deliver_finish+0x10f/0x410
[ 5059.438492]  [<ffffffff81467a5e>] ? ip_local_deliver_finish+0x3e/0x410
[ 5059.438621]  [<ffffffff81468653>] ip_local_deliver+0x43/0x80
[ 5059.438733]  [<ffffffff81467f70>] ip_rcv_finish+0x140/0x5a0
[ 5059.438843]  [<ffffffff81468926>] ip_rcv+0x296/0x3f0
[ 5059.438945]  [<ffffffff81430b72>] __netif_receive_skb_core+0x742/0x940
[ 5059.439074]  [<ffffffff8143054d>] ? __netif_receive_skb_core+0x11d/0x940
[ 5059.442231]  [<ffffffff8108c81d>] ? trace_hardirqs_on+0xd/0x10
[ 5059.442231]  [<ffffffff81430d83>] __netif_receive_skb+0x13/0x60
[ 5059.442231]  [<ffffffff81431c1e>] netif_receive_skb+0x1e/0x1f0
[ 5059.442231]  [<ffffffff814334e0>] napi_gro_receive+0x70/0xa0
[ 5059.442231]  [<ffffffffa01de426>] rtl8169_poll+0x166/0x700 [r8169]
[ 5059.442231]  [<ffffffff81432bc9>] net_rx_action+0x129/0x1e0
[ 5059.442231]  [<ffffffff810478cd>] __do_softirq+0xed/0x240
[ 5059.442231]  [<ffffffff81047e25>] irq_exit+0x125/0x140
[ 5059.442231]  [<ffffffff81004241>] do_IRQ+0x51/0xc0
[ 5059.442231]  [<ffffffff81542bef>] common_interrupt+0x6f/0x6f

We need to keep a reference on the socket, by using skb_steal_sock()
at the right place.

Note that another patch is needed to fix a race in
udp_sk_rx_dst_set(), as we hold no lock protecting the dst.

Fixes: 421b3885 ("udp: ipv4: Add udp early demux")
Reported-by: NDave Jones <davej@redhat.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Shawn Bohrer <sbohrer@rgmadvisors.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8afdd99a

inet: fix NULL pointer Oops in fib(6)_rule_suppress · 673498b8

由 Stefan Tomanek 提交于 12月 10, 2013

This changes ensures that the routing entry investigated by the suppress
function actually does point to a device struct before following that pointer,
fixing a possible kernel oops situation when verifying the interface group
associated with a routing table entry.

According to Daniel Golle, this Oops can be triggered by a user process trying
to establish an outgoing IPv6 connection while having no real IPv6 connectivity
set up (only autoassigned link-local addresses).

Fixes: 6ef94cfa ("fib_rules: add route suppression based on ifgroup")
Reported-by: NDaniel Golle <daniel.golle@gmail.com>
Tested-by: NDaniel Golle <daniel.golle@gmail.com>
Signed-off-by: NStefan Tomanek <stefan.tomanek@wertarbyte.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

673498b8

06 12月, 2013 1 次提交

tcp_memcontrol: Cleanup/fix cg_proto->memory_pressure handling. · 7f2cbdc2

由 Eric W. Biederman 提交于 12月 04, 2013

kill memcg_tcp_enter_memory_pressure.  The only function of
memcg_tcp_enter_memory_pressure was to reduce deal with the
unnecessary abstraction that was tcp_memcontrol.  Now that struct
tcp_memcontrol is gone remove this unnecessary function, the
unnecessary function pointer, and modify sk_enter_memory_pressure to
set this field directly, just as sk_leave_memory_pressure cleas this
field directly.

This fixes a small bug I intruduced when killing struct tcp_memcontrol
that caused memcg_tcp_enter_memory_pressure to never be called and
thus failed to ever set cg_proto->memory_pressure.

Remove the cg_proto enter_memory_pressure function as it now serves
no useful purpose.

Don't test cg_proto->memory_presser in sk_leave_memory_pressure before
clearing it.  The test was originally there to ensure that the pointer
was non-NULL.  Now that cg_proto is not a pointer the pointer does not
matter.
Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7f2cbdc2

30 11月, 2013 2 次提交

inet: fix possible seqlock deadlocks · f1d8cba6

由 Eric Dumazet 提交于 11月 28, 2013

In commit c9e90429 ("ipv4: fix possible seqlock deadlock") I left
another places where IP_INC_STATS_BH() were improperly used.

udp_sendmsg(), ping_v4_sendmsg() and tcp_v4_connect() are called from
process context, not from softirq context.

This was detected by lockdep seqlock support.
Reported-by: Njongman heo <jongman.heo@samsung.com>
Fixes: 584bdf8c ("[IPV4]: Fix "ipOutNoRoutes" counter error for TCP and UDP")
Fixes: c319b4d7 ("net: ipv4: add IPPROTO_ICMP socket kind")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f1d8cba6

net: update consumers of MSG_MORE to recognize MSG_SENDPAGE_NOTLAST · d3f7d56a

由 Shawn Landden 提交于 11月 24, 2013

Commit 35f9c09f (tcp: tcp_sendpages() should call tcp_push() once)
added an internal flag MSG_SENDPAGE_NOTLAST, similar to
MSG_MORE.

algif_hash, algif_skcipher, and udp used MSG_MORE from tcp_sendpages()
and need to see the new flag as identical to MSG_MORE.

This fixes sendfile() on AF_ALG.

v3: also fix udp

Cc: Tom Herbert <therbert@google.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: <stable@vger.kernel.org> # 3.4.x + 3.2.x
Reported-and-tested-by: NShawn Landden <shawnlandden@gmail.com>
Original-patch: Richard Weinberger <richard@nod.at>
Signed-off-by: NShawn Landden <shawn@churchofgit.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d3f7d56a

29 11月, 2013 1 次提交

net: remove outdated comment for ipv4 and ipv6 protocol handler · 39af0c40

由 Baker Zhang 提交于 11月 26, 2013

since f9242b6b
inet: Sanitize inet{,6} protocol demux.

there are not pretended hash tables for ipv4 or
ipv6 protocol handler.
Signed-off-by: NBaker Zhang <Baker.kernel@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

39af0c40

24 11月, 2013 4 次提交

inet: fix addr_len/msg->msg_namelen assignment in recv_error and rxpmtu functions · 85fbaa75

由 Hannes Frederic Sowa 提交于 11月 23, 2013

Commit bceaa902 ("inet: prevent leakage
of uninitialized memory to user in recv syscalls") conditionally updated
addr_len if the msg_name is written to. The recv_error and rxpmtu
functions relied on the recvmsg functions to set up addr_len before.

As this does not happen any more we have to pass addr_len to those
functions as well and set it to the size of the corresponding sockaddr
length.

This broke traceroute and such.

Fixes: bceaa902 ("inet: prevent leakage of uninitialized memory to user in recv syscalls")
Reported-by: NBrad Spengler <spender@grsecurity.net>
Reported-by: Tom Labanowski
Cc: mpb <mpb.mail@gmail.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

85fbaa75

tcp_memcg: remove useless var old_lim · fb10f802

由 Gao feng 提交于 11月 22, 2013

nobody needs it. remove.
Signed-off-by: NGao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fb10f802

gro: Clean up tcpX_gro_receive checksum verification · b8ee93ba

由 Herbert Xu 提交于 11月 22, 2013

This patch simplifies the checksum verification in tcpX_gro_receive
by reusing the CHECKSUM_COMPLETE code for CHECKSUM_NONE.  All it
does for CHECKSUM_NONE is compute the partial checksum and then
treat it as if it came from the hardware (CHECKSUM_COMPLETE).
Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>

Cheers,
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b8ee93ba

gro: Only verify TCP checksums for candidates · cc5c00bb

由 Herbert Xu 提交于 11月 22, 2013

In some cases we may receive IP packets that are longer than
their stated lengths.  Such packets are never merged in GRO.
However, we may end up computing their checksums incorrectly
and end up allowing packets with a bogus checksum enter our
stack with the checksum status set as verified.

Since such packets are rare and not performance-critical, this
patch simply skips the checksum verification for them.
Reported-by: NAlexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
Acked-by: NAlexander Duyck <alexander.h.duyck@intel.com>

Thanks,
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

cc5c00bb

21 11月, 2013 1 次提交

ipv4: fix race in concurrent ip_route_input_slow() · dcdfdf56

由 Alexei Starovoitov 提交于 11月 19, 2013

CPUs can ask for local route via ip_route_input_noref() concurrently.
if nh_rth_input is not cached yet, CPUs will proceed to allocate
equivalent DSTs on 'lo' and then will try to cache them in nh_rth_input
via rt_cache_route()
Most of the time they succeed, but on occasion the following two lines:
	orig = *p;
	prev = cmpxchg(p, orig, rt);
in rt_cache_route() do race and one of the cpus fails to complete cmpxchg.
But ip_route_input_slow() doesn't check the return code of rt_cache_route(),
so dst is leaking. dst_destroy() is never called and 'lo' device
refcnt doesn't go to zero, which can be seen in the logs as:
	unregister_netdevice: waiting for lo to become free. Usage count = 1
Adding mdelay() between above two lines makes it easily reproducible.
Fix it similar to nh_pcpu_rth_output case.

Fixes: d2d68ba9 ("ipv4: Cache input routes in fib_info nexthops.")
Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

dcdfdf56

20 11月, 2013 3 次提交

genetlink: only pass array to genl_register_family_with_ops() · c53ed742

由 Johannes Berg 提交于 11月 19, 2013

As suggested by David Miller, make genl_register_family_with_ops()
a macro and pass only the array, evaluating ARRAY_SIZE() in the
macro, this is a little safer.

The openvswitch has some indirection, assing ops/n_ops directly in
that code. This might ultimately just assign the pointers in the
family initializations, saving the struct genl_family_and_ops and
code (once mcast groups are handled differently.)
Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c53ed742

tcp: don't update snd_nxt, when a socket is switched from repair mode · dbde4979

由 Andrey Vagin 提交于 11月 19, 2013

snd_nxt must be updated synchronously with sk_send_head.  Otherwise
tp->packets_out may be updated incorrectly, what may bring a kernel panic.

Here is a kernel panic from my host.
[  103.043194] BUG: unable to handle kernel NULL pointer dereference at 0000000000000048
[  103.044025] IP: [<ffffffff815aaaaf>] tcp_rearm_rto+0xcf/0x150
...
[  146.301158] Call Trace:
[  146.301158]  [<ffffffff815ab7f0>] tcp_ack+0xcc0/0x12c0

Before this panic a tcp socket was restored. This socket had sent and
unsent data in the write queue. Sent data was restored in repair mode,
then the socket was switched from reapair mode and unsent data was
restored. After that the socket was switched back into repair mode.

In that moment we had a socket where write queue looks like this:
snd_una    snd_nxt   write_seq
   |_________|________|
             |
	  sk_send_head

After a second switching from repair mode the state of socket was
changed:

snd_una          snd_nxt, write_seq
   |_________ ________|
             |
	  sk_send_head

This state is inconsistent, because snd_nxt and sk_send_head are not
synchronized.

Bellow you can find a call trace, how packets_out can be incremented
twice for one skb, if snd_nxt and sk_send_head are not synchronized.
In this case packets_out will be always positive, even when
sk_write_queue is empty.

tcp_write_wakeup
	skb = tcp_send_head(sk);
	tcp_fragment
		if (!before(tp->snd_nxt, TCP_SKB_CB(buff)->end_seq))
			tcp_adjust_pcount(sk, skb, diff);
	tcp_event_new_data_sent
		tp->packets_out += tcp_skb_pcount(skb);

I think update of snd_nxt isn't required, when a socket is switched from
repair mode.  Because it's initialized in tcp_connect_init. Then when a
write queue is restored, snd_nxt is incremented in tcp_event_new_data_sent,
so it's always is in consistent state.

I have checked, that the bug is not reproduced with this patch and
all tests about restoring tcp connections work fine.

Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: NAndrey Vagin <avagin@openvz.org>
Acked-by: NPavel Emelyanov <xemul@parallels.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

dbde4979

xfrm: Release dst if this dst is improper for vti tunnel · 236c9f84

由 fan.du 提交于 11月 19, 2013

After searching rt by the vti tunnel dst/src parameter,
if this rt has neither attached to any transformation
nor the transformation is not tunnel oriented, this rt
should be released back to ip layer.

otherwise causing dst memory leakage.
Signed-off-by: NFan Du <fan.du@windriver.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

236c9f84

19 11月, 2013 2 次提交

ping: prevent NULL pointer dereference on write to msg_name · cf970c00

由 Hannes Frederic Sowa 提交于 11月 18, 2013

A plain read() on a socket does set msg->msg_name to NULL. So check for
NULL pointer first.
Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

cf970c00

inet: prevent leakage of uninitialized memory to user in recv syscalls · bceaa902

由 Hannes Frederic Sowa 提交于 11月 18, 2013

Only update *addr_len when we actually fill in sockaddr, otherwise we
can return uninitialized memory from the stack to the caller in the
recvfrom, recvmmsg and recvmsg syscalls. Drop the the (addr_len == NULL)
checks because we only get called with a valid addr_len pointer either
from sock_common_recvmsg or inet_recvmsg.

If a blocking read waits on a socket which is concurrently shut down we
now return zero and set msg_msgnamelen to 0.
Reported-by: Nmpb <mpb.mail@gmail.com>
Suggested-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

bceaa902

18 11月, 2013 1 次提交

netfilter: synproxy: send mss option to backend · a6441b7a

由 Martin Topholm 提交于 11月 14, 2013

When the synproxy_parse_options is called on the client ack the mss
option will not be present. Consequently mss wont be included in the
backend syn packet, which falls back to 536 bytes mss.

Therefore XT_SYNPROXY_OPT_MSS is explicitly flagged when recovering mss
value from cookie.
Signed-off-by: NMartin Topholm <mph@one.com>
Reviewed-by: NJesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

a6441b7a

15 11月, 2013 5 次提交

seq_file: remove "%n" usage from seq_file users · 652586df

由 Tetsuo Handa 提交于 11月 14, 2013

All seq_printf() users are using "%n" for calculating padding size,
convert them to use seq_setwidth() / seq_pad() pair.
Signed-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: NKees Cook <keescook@chromium.org>
Cc: Joe Perches <joe@perches.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

652586df

ipv4: fix possible seqlock deadlock · c9e90429

由 Eric Dumazet 提交于 11月 14, 2013

ip4_datagram_connect() being called from process context,
it should use IP_INC_STATS() instead of IP_INC_STATS_BH()
otherwise we can deadlock on 32bit arches, or get corruptions of
SNMP counters.

Fixes: 584bdf8c ("[IPV4]: Fix "ipOutNoRoutes" counter error for TCP and UDP")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Reported-by: NDave Jones <davej@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c9e90429

genetlink: make all genl_ops users const · 4534de83

由 Johannes Berg 提交于 11月 14, 2013

Now that genl_ops are no longer modified in place when
registering, they can be made const. This patch was done
mostly with spatch:

@@
identifier ops;
@@
+const
 struct genl_ops ops[] = {
 ...
 };

(except the struct thing in net/openvswitch/datapath.c)
Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4534de83

net-tcp: fix panic in tcp_fastopen_cache_set() · dccf76ca

由 Eric Dumazet 提交于 11月 13, 2013

We had some reports of crashes using TCP fastopen, and Dave Jones
gave a nice stack trace pointing to the error.

Issue is that tcp_get_metrics() should not be called with a NULL dst

Fixes: 1fe4c481 ("net-tcp: Fast Open client - cookie cache")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Reported-by: NDave Jones <davej@redhat.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: NYuchung Cheng <ycheng@google.com>
Tested-by: NDave Jones <davej@fedoraproject.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

dccf76ca

tcp: tsq: restore minimal amount of queueing · 98e09386

由 Eric Dumazet 提交于 11月 13, 2013

After commit c9eeec26 ("tcp: TSQ can use a dynamic limit"), several
users reported throughput regressions, notably on mvneta and wifi
adapters.

802.11 AMPDU requires a fair amount of queueing to be effective.

This patch partially reverts the change done in tcp_write_xmit()
so that the minimal amount is sysctl_tcp_limit_output_bytes.

It also remove the use of this sysctl while building skb stored
in write queue, as TSO autosizing does the right thing anyway.

Users with well behaving NICS and correct qdisc (like sch_fq),
can then lower the default sysctl_tcp_limit_output_bytes value from
128KB to 8KB.

This new usage of sysctl_tcp_limit_output_bytes permits each driver
authors to check how their driver performs when/if the value is set
to a minimum of 4KB.

Normally, line rate for a single TCP flow should be possible,
but some drivers rely on timers to perform TX completion and
too long TX completion delays prevent reaching full throughput.

Fixes: c9eeec26 ("tcp: TSQ can use a dynamic limit")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Reported-by: NSujith Manoharan <sujith@msujith.org>
Reported-by: NArnaud Ebalard <arno@natisbad.org>
Tested-by: NSujith Manoharan <sujith@msujith.org>
Cc: Felix Fietkau <nbd@openwrt.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

98e09386

14 11月, 2013 1 次提交

core/dev: do not ignore dmac in dev_forward_skb() · 81b9eab5

由 Alexei Starovoitov 提交于 11月 12, 2013

commit 06a23fe3
("core/dev: set pkt_type after eth_type_trans() in dev_forward_skb()")
and refactoring 64261f23
("dev: move skb_scrub_packet() after eth_type_trans()")

are forcing pkt_type to be PACKET_HOST when skb traverses veth.

which means that ip forwarding will kick in inside netns
even if skb->eth->h_dest != dev->dev_addr

Fix order of eth_type_trans() and skb_scrub_packet() in dev_forward_skb()
and in ip_tunnel_rcv()

Fixes: 06a23fe3 ("core/dev: set pkt_type after eth_type_trans() in dev_forward_skb()")
CC: Isaku Yamahata <yamahatanetdev@gmail.com>
CC: Maciej Zenczykowski <zenczykowski@gmail.com>
CC: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

81b9eab5

08 11月, 2013 1 次提交

inet: fix a UFO regression · dcd60771

由 Eric Dumazet 提交于 11月 07, 2013

While testing virtio_net and skb_segment() changes, Hannes reported
that UFO was sending wrong frames.

It appears this was introduced by a recent commit :
8c3a897b ("inet: restore gso for vxlan")

The old condition to perform IP frag was :

tunnel = !!skb->encapsulation;
...
        if (!tunnel && proto == IPPROTO_UDP) {

So the new one should be :

udpfrag = !skb->encapsulation && proto == IPPROTO_UDP;
...
        if (udpfrag) {

Initialization of udpfrag must be done before call
to ops->callbacks.gso_segment(skb, features), as
skb_udp_tunnel_segment() clears skb->encapsulation

(We want udpfrag to be true for UFO, false for VXLAN)

With help from Alexei Starovoitov
Reported-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

dcd60771

06 11月, 2013 2 次提交

net: Explicitly initialize u64_stats_sync structures for lockdep · 827da44c

由 John Stultz 提交于 10月 07, 2013

In order to enable lockdep on seqcount/seqlock structures, we
must explicitly initialize any locks.

The u64_stats_sync structure, uses a seqcount, and thus we need
to introduce a u64_stats_init() function and use it to initialize
the structure.

This unfortunately adds a lot of fairly trivial initialization code
to a number of drivers. But the benefit of ensuring correctness makes
this worth while.

Because these changes are required for lockdep to be enabled, and the
changes are quite trivial, I've not yet split this patch out into 30-some
separate patches, as I figured it would be better to get the various
maintainers thoughts on how to best merge this change along with
the seqcount lockdep enablement.

Feedback would be appreciated!
Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
Acked-by: NJulian Anastasov <ja@ssi.bg>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: James Morris <jmorris@namei.org>
Cc: Jesse Gross <jesse@nicira.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Mirko Lindner <mlindner@marvell.com>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Roger Luethi <rl@hellgate.ch>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Simon Horman <horms@verge.net.au>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Cc: Wensong Zhang <wensong@linux-vs.org>
Cc: netdev@vger.kernel.org
Link: http://lkml.kernel.org/r/1381186321-4906-2-git-send-email-john.stultz@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>

827da44c

ipv4: introduce new IP_MTU_DISCOVER mode IP_PMTUDISC_INTERFACE · 482fc609

由 Hannes Frederic Sowa 提交于 11月 05, 2013

Sockets marked with IP_PMTUDISC_INTERFACE won't do path mtu discovery,
their sockets won't accept and install new path mtu information and they
will always use the interface mtu for outgoing packets. It is guaranteed
that the packet is not fragmented locally. But we won't set the DF-Flag
on the outgoing frames.

Florian Weimer had the idea to use this flag to ensure DNS servers are
never generating outgoing fragments. They may well be fragmented on the
path, but the server never stores or usees path mtu values, which could
well be forged in an attack.

(The root of the problem with path MTU discovery is that there is
no reliable way to authenticate ICMP Fragmentation Needed But DF Set
messages because they are sent from intermediate routers with their
source addresses, and the IMCP payload will not always contain sufficient
information to identify a flow.)

Recent research in the DNS community showed that it is possible to
implement an attack where DNS cache poisoning is feasible by spoofing
fragments. This work was done by Amir Herzberg and Haya Shulman:
<https://sites.google.com/site/hayashulman/files/fragmentation-poisoning.pdf>

This issue was previously discussed among the DNS community, e.g.
<http://www.ietf.org/mail-archive/web/dnsext/current/msg01204.html>,
without leading to fixes.

This patch depends on the patch "ipv4: fix DO and PROBE pmtu mode
regarding local fragmentation with UFO/CORK" for the enforcement of the
non-fragmentable checks. If other users than ip_append_page/data should
use this semantic too, we have to add a new flag to IPCB(skb)->flags to
suppress local fragmentation and check for this in ip_finish_output.

Many thanks to Florian Weimer for the idea and feedback while implementing
this patch.

Cc: David S. Miller <davem@davemloft.net>
Suggested-by: NFlorian Weimer <fweimer@redhat.com>
Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

482fc609

05 11月, 2013 2 次提交

tcp: properly handle stretch acks in slow start · 9f9843a7

由 Yuchung Cheng 提交于 10月 31, 2013

Slow start now increases cwnd by 1 if an ACK acknowledges some packets,
regardless the number of packets. Consequently slow start performance
is highly dependent on the degree of the stretch ACKs caused by
receiver or network ACK compression mechanisms (e.g., delayed-ACK,
GRO, etc).  But slow start algorithm is to send twice the amount of
packets of packets left so it should process a stretch ACK of degree
N as if N ACKs of degree 1, then exits when cwnd exceeds ssthresh. A
follow up patch will use the remainder of the N (if greater than 1)
to adjust cwnd in the congestion avoidance phase.

In addition this patch retires the experimental limited slow start
(LSS) feature. LSS has multiple drawbacks but questionable benefit. The
fractional cwnd increase in LSS requires a loop in slow start even
though it's rarely used. Configuring such an increase step via a global
sysctl on different BDPS seems hard. Finally and most importantly the
slow start overshoot concern is now better covered by the Hybrid slow
start (hystart) enabled by default.
Signed-off-by: NYuchung Cheng <ycheng@google.com>
Signed-off-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9f9843a7

tcp: enable sockets to use MSG_FASTOPEN by default · 0d41cca4

由 Yuchung Cheng 提交于 10月 31, 2013

Applications have started to use Fast Open (e.g., Chrome browser has
such an optional flag) and the feature has gone through several
generations of kernels since 3.7 with many real network tests. It's
time to enable this flag by default for applications to test more
conveniently and extensively.
Signed-off-by: NYuchung Cheng <ycheng@google.com>
Signed-off-by: NNeal Cardwell <ncardwell@google.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0d41cca4