提交 · a34cd4f31919119d8ab2d42330fb8364aa430551 · openanolis / cloud-kernel

25 2月, 2014 9 次提交

vti4: Use the on xfrm_lookup returned dst_entry directly · a34cd4f3

由 Steffen Klassert 提交于 2月 21, 2014

We need to be protocol family indepenent to support
inter addresss family tunneling with vti. So use a
dst_entry instead of the ipv4 rtable in vti_tunnel_xmit.
Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>

a34cd4f3

xfrm4: Remove xfrm_tunnel_notifier · 9994bb8e

由 Steffen Klassert 提交于 2月 21, 2014

This was used from vti and is replaced by the IPsec protocol
multiplexer hooks. It is now unused, so remove it.
Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>

9994bb8e

vti: Update the ipv4 side to use it's own receive hook. · df3893c1

由 Steffen Klassert 提交于 2月 21, 2014

With this patch, vti uses the IPsec protocol multiplexer to
register it's own receive side hooks for ESP, AH and IPCOMP.

Vti now does the following on receive side:

1. Do an input policy check for the IPsec packet we received.
   This is required because this packet could be already
   prosecces by IPsec, so an inbuond policy check is needed.

2. Mark the packet with the i_key. The policy and the state
   must match this key now. Policy and state belong to the outer
   namespace and policy enforcement is done at the further layers.

3. Call the generic xfrm layer to do decryption and decapsulation.

4. Wait for a callback from the xfrm layer to properly clean the
   skb to not leak informations on namespace and to update the
   device statistics.

On transmit side:

1. Mark the packet with the o_key. The policy and the state
   must match this key now.

2. Do a xfrm_lookup on the original packet with the mark applied.

3. Check if we got an IPsec route.

4. Clean the skb to not leak informations on namespace
   transitions.

5. Attach the dst_enty we got from the xfrm_lookup to the skb.

6. Call dst_output to do the IPsec processing.

7. Do the device statistics.
Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>

df3893c1

ip_tunnel: Make vti work with i_key set · 6d608f06

由 Steffen Klassert 提交于 2月 21, 2014

Vti uses the o_key to mark packets that were transmitted or received
by a vti interface. Unfortunately we can't apply different marks
to in and outbound packets with only one key availabe. Vti interfaces
typically use wildcard selectors for vti IPsec policies. On forwarding,
the same output policy will match for both directions. This generates
a loop between the IPsec gateways until the ttl of the packet is
exceeded.

The gre i_key/o_key are usually there to find the right gre tunnel
during a lookup. When vti uses the i_key to mark packets, the tunnel
lookup does not work any more because vti does not use the gre keys
as a hash key for the lookup.

This patch workarounds this my not including the i_key when comupting
the hash for the tunnel lookup in case of vti tunnels.

With this we have separate keys available for the transmitting and
receiving side of the vti interface.
Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>

6d608f06

xfrm: Add xfrm_tunnel_skb_cb to the skb common buffer · 70be6c91

由 Steffen Klassert 提交于 2月 21, 2014

IPsec vti_rcv needs to remind the tunnel pointer to
check it later at the vti_rcv_cb callback. So add
this pointer to the IPsec common buffer, initialize
it and check it to avoid transport state matching of
a tunneled packet.
Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>

70be6c91

ipcomp4: Use the IPsec protocol multiplexer API · d099160e

由 Steffen Klassert 提交于 2月 21, 2014

Switch ipcomp4 to use the new IPsec protocol multiplexer.
Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>

d099160e

ah4: Use the IPsec protocol multiplexer API · e5b56454

由 Steffen Klassert 提交于 2月 21, 2014

Switch ah4 to use the new IPsec protocol multiplexer.
Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>

e5b56454

esp4: Use the IPsec protocol multiplexer API · 827789cb

由 Steffen Klassert 提交于 2月 21, 2014

Switch esp4 to use the new IPsec protocol multiplexer.
Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>

827789cb

xfrm4: Add IPsec protocol multiplexer · 3328715e

由 Steffen Klassert 提交于 2月 21, 2014

This patch add an IPsec protocol multiplexer. With this
it is possible to add alternative protocol handlers as
needed for IPsec virtual tunnel interfaces.
Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>

3328715e

20 2月, 2014 2 次提交

tcp: use zero-window when free_space is low · 86c1a045

由 Florian Westphal 提交于 2月 19, 2014

Currently the kernel tries to announce a zero window when free_space
is below the current receiver mss estimate.

When a sender is transmitting small packets and reader consumes data
slowly (or not at all), receiver might be unable to shrink the receive
win because

a) we cannot withdraw already-commited receive window, and,
b) we have to round the current rwin up to a multiple of the wscale
   factor, else we would shrink the current window.

This causes the receive buffer to fill up until the rmem limit is hit.
When this happens, we start dropping packets.

Moreover, tcp_clamp_window may continue to grow sk_rcvbuf towards rmem[2]
even if socket is not being read from.

As we cannot avoid the "current_win is rounded up to multiple of mss"
issue [we would violate a) above] at least try to prevent the receive buf
growth towards tcp_rmem[2] limit by attempting to move to zero-window
announcement when free_space becomes less than 1/16 of the current
allowed receive buffer maximum.  If tcp_rmem[2] is large, this will
increase our chances to get a zero-window announcement out in time.

Reproducer:
On server:
$ nc -l -p 12345
<suspend it: CTRL-Z>

Client:
#!/usr/bin/env python
import socket
import time

sock = socket.socket()
sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)
sock.connect(("192.168.4.1", 12345));
while True:
   sock.send('A' * 23)
   time.sleep(0.005)

socket buffer on server-side will grow until tcp_rmem[2] is hit,
at which point the client rexmits data until -EDTIMEOUT:

tcp_data_queue invokes tcp_try_rmem_schedule which will call
tcp_prune_queue which calls tcp_clamp_window().  And that function will
grow sk->sk_rcvbuf up until it eventually hits tcp_rmem[2].

Thanks to Eric Dumazet for running regression tests.

Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Tested-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

86c1a045

ipv6: honor IPV6_PKTINFO with v4 mapped addresses on sendmsg · c8e6ad08

由 Hannes Frederic Sowa 提交于 2月 18, 2014

In case we decide in udp6_sendmsg to send the packet down the ipv4
udp_sendmsg path because the destination is either of family AF_INET or
the destination is an ipv4 mapped ipv6 address, we don't honor the
maybe specified ipv4 mapped ipv6 address in IPV6_PKTINFO.

We simply can check for this option in ip_cmsg_send because no calls to
ipv6 module functions are needed to do so.
Reported-by: NGert Doering <gert@space.net>
Cc: Tore Anderson <tore@fud.no>
Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c8e6ad08

18 2月, 2014 1 次提交

ipv4: fix counter in_slow_tot · a6254864

由 Duan Jiong 提交于 2月 17, 2014

since commit 89aef892("ipv4: Delete routing cache."), the counter
in_slow_tot can't work correctly.

The counter in_slow_tot increase by one when fib_lookup() return successfully
in ip_route_input_slow(), but actually the dst struct maybe not be created and
cached, so we can increase in_slow_tot after the dst struct is created.
Signed-off-by: NDuan Jiong <duanj.fnst@cn.fujitsu.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a6254864

17 2月, 2014 2 次提交

ip_tunnel: return more precise errno value when adding tunnel fails · 6dd3c9ec

由 Florian Westphal 提交于 2月 14, 2014

Currently this always returns ENOBUFS, because the return value of
__ip_tunnel_create is discarded.

A more common failure is a duplicate name (EEXIST).  Propagate the real
error code so userspace can display a more meaningful error message.
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6dd3c9ec

ipv4: distinguish EHOSTUNREACH from the ENETUNREACH · cd0f0b95

由 Duan Jiong 提交于 2月 14, 2014

since commit 251da413("ipv4: Cache ip_error() routes even when not forwarding."),
the counter IPSTATS_MIB_INADDRERRORS can't work correctly, because the value of
err was always set to ENETUNREACH.
Signed-off-by: NDuan Jiong <duanj.fnst@cn.fujitsu.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

cd0f0b95

15 2月, 2014 2 次提交

tcp: add pacing_rate information into tcp_info · 977cb0ec

由 Eric Dumazet 提交于 2月 13, 2014

Add two new fields to struct tcp_info, to report sk_pacing_rate
and sk_max_pacing_rate to monitoring applications, as ss from iproute2.

User exported fields are 64bit, even if kernel is currently using 32bit
fields.

lpaa5:~# ss -i
..
	 skmem:(r0,rb357120,t0,tb2097152,f1584,w1980880,o0,bl0) ts sack cubic
wscale:6,6 rto:400 rtt:0.875/0.75 mss:1448 cwnd:1 ssthresh:12 send
13.2Mbps pacing_rate 3336.2Mbps unacked:15 retrans:1/5448 lost:15
rcv_space:29200
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

977cb0ec

net: introduce netdev_alloc_pcpu_stats() for drivers · 1c213bd2

由 WANG Cong 提交于 2月 13, 2014

There are many drivers calling alloc_percpu() to allocate pcpu stats
and then initializing ->syncp. So just introduce a helper function for them.

Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1c213bd2

14 2月, 2014 5 次提交

ipv4: ipconfig.c: add parentheses in an if statement · 357137a4

由 FX Le Bail 提交于 2月 10, 2014

Even if the 'time_before' macro expand with parentheses, the look is bad.
Signed-off-by: NFrancois-Xavier Le Bail <fx.lebail@yahoo.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

357137a4

ipv4: ip_forward: perform skb->pkt_type check at the beginning · d4f2fa6a

由 Denis Kirjanov 提交于 2月 13, 2014

Packets which have L2 address different from ours should be
already filtered before entering into ip_forward().

Perform that check at the beginning to avoid processing such packets.
Signed-off-by: NDenis Kirjanov <kda@linux-powerpc.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d4f2fa6a

net: remove unnecessary return's · 2045ceae

由 stephen hemminger 提交于 2月 12, 2014

One of my pet coding style peeves is the practice of
adding extra return; at the end of function.
Kill several instances of this in network code.

I suppose some coccinelle wizardy could do this automatically.
Signed-off-by: NStephen Hemminger <stephen@networkplumber.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2045ceae

tcp: remove unused min_cwnd member of tcp_congestion_ops · 45f74359

由 Stanislav Fomichev 提交于 2月 12, 2014

Commit 684bad11 "tcp: use PRR to reduce cwin in CWR state" removed all
calls to min_cwnd, so we can safely remove it.
Also, remove tcp_reno_min_cwnd because it was only used for min_cwnd.
Signed-off-by: NStanislav Fomichev <stfomichev@yandex-team.ru>
Acked-by: NYuchung Cheng <ycheng@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

45f74359

net: ip, ipv6: handle gso skbs in forwarding path · fe6cc55f

由 Florian Westphal 提交于 2月 13, 2014

Marcelo Ricardo Leitner reported problems when the forwarding link path
has a lower mtu than the incoming one if the inbound interface supports GRO.

Given:
Host <mtu1500> R1 <mtu1200> R2

Host sends tcp stream which is routed via R1 and R2.  R1 performs GRO.

In this case, the kernel will fail to send ICMP fragmentation needed
messages (or pkt too big for ipv6), as GSO packets currently bypass dstmtu
checks in forward path. Instead, Linux tries to send out packets exceeding
the mtu.

When locking route MTU on Host (i.e., no ipv4 DF bit set), R1 does
not fragment the packets when forwarding, and again tries to send out
packets exceeding R1-R2 link mtu.

This alters the forwarding dstmtu checks to take the individual gso
segment lengths into account.

For ipv6, we send out pkt too big error for gso if the individual
segments are too big.

For ipv4, we either send icmp fragmentation needed, or, if the DF bit
is not set, perform software segmentation and let the output path
create fragments when the packet is leaving the machine.
It is not 100% correct as the error message will contain the headers of
the GRO skb instead of the original/segmented one, but it seems to
work fine in my (limited) tests.

Eric Dumazet suggested to simply shrink mss via ->gso_size to avoid
sofware segmentation.

However it turns out that skb_segment() assumes skb nr_frags is related
to mss size so we would BUG there.  I don't want to mess with it considering
Herbert and Eric disagree on what the correct behavior should be.

Hannes Frederic Sowa notes that when we would shrink gso_size
skb_segment would then also need to deal with the case where
SKB_MAX_FRAGS would be exceeded.

This uses sofware segmentation in the forward path when we hit ipv4
non-DF packets and the outgoing link mtu is too small.  Its not perfect,
but given the lack of bug reports wrt. GRO fwd being broken this is a
rare case anyway.  Also its not like this could not be improved later
once the dust settles.
Acked-by: NHerbert Xu <herbert@gondor.apana.org.au>
Reported-by: NMarcelo Ricardo Leitner <mleitner@redhat.com>
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fe6cc55f

12 2月, 2014 2 次提交

{IPv4,xfrm} Add ESN support for AH ingress part · d8b2a860

由 Fan Du 提交于 1月 18, 2014

This patch add esn support for AH input stage by attaching upper 32bits
sequence number right after packet payload as specified by RFC 4302.

Then the ICV value will guard upper 32bits sequence number as well when
packet getting in.
Signed-off-by: NFan Du <fan.du@windriver.com>
Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>

d8b2a860

{IPv4,xfrm} Add ESN support for AH egress part · d4d573d0

由 Fan Du 提交于 1月 18, 2014

This patch add esn support for AH output stage by attaching upper 32bits
sequence number right after packet payload as specified by RFC 4302.

Then the ICV value will guard upper 32bits sequence number as well when
packet going out.
Signed-off-by: NFan Du <fan.du@windriver.com>
Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>

d4d573d0

11 2月, 2014 1 次提交

tcp: tsq: fix nonagle handling · bf06200e

由 John Ogness 提交于 2月 09, 2014

Commit 46d3ceab ("tcp: TCP Small Queues") introduced a possible
regression for applications using TCP_NODELAY.

If TCP session is throttled because of tsq, we should consult
tp->nonagle when TX completion is done and allow us to send additional
segment, especially if this segment is not a full MSS.
Otherwise this segment is sent after an RTO.

[edumazet] : Cooked the changelog, added another fix about testing
sk_wmem_alloc twice because TX completion can happen right before
setting TSQ_THROTTLED bit.

This problem is particularly visible with recent auto corking,
but might also be triggered with low tcp_limit_output_bytes
values or NIC drivers delaying TX completion by hundred of usec,
and very low rtt.

Thomas Glanzmann for example reported an iscsi regression, caused
by tcp auto corking making this bug quite visible.

Fixes: 46d3ceab ("tcp: TCP Small Queues")
Signed-off-by: NJohn Ogness <john.ogness@linutronix.de>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Reported-by: NThomas Glanzmann <thomas@glanzmann.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

bf06200e

10 2月, 2014 1 次提交

tcp: correct code comment stating 3 min timeout for FIN_WAIT2, we only do 1 min · b10bd54c

由 Jesper Juhl 提交于 2月 09, 2014

As far as I can tell we have used a default of 60 seconds for
FIN_WAIT2 timeout for ages (since 2.x times??).

In any case, the timeout these days is 60 seconds, so the 3 min
comment is wrong (and cost me a few minutes of my life when I was
debugging a FIN_WAIT2 related problem in a userspace application and
checked the kernel source for details).
Signed-off-by: NJesper Juhl <jj@chaosbits.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b10bd54c

07 2月, 2014 2 次提交

tcp: remove 1ms offset in srtt computation · 4a5ab4e2

由 Eric Dumazet 提交于 2月 06, 2014

TCP pacing depends on an accurate srtt estimation.

Current srtt estimation is using jiffie resolution,
and has an artificial offset of at least 1 ms, which can produce
slowdowns when FQ/pacing is used, especially in DC world,
where typical rtt is below 1 ms.

We are planning a switch to usec resolution for linux-3.15,
but in the meantime, this patch removes the 1 ms offset.

All we need is to have tp->srtt minimal value of 1 to differentiate
the case of srtt being initialized or not, not 8.

The problematic behavior was observed on a 40Gbit testbed,
where 32 concurrent netperf were reaching 12Gbps of aggregate
speed, instead of line speed.

This patch also has the effect of reporting more accurate srtt and send
rates to iproute2 ss command as in :

$ ss -i dst cca2
Netid  State      Recv-Q Send-Q          Local Address:Port
Peer Address:Port
tcp    ESTAB      0      0                10.244.129.1:56984
10.244.129.2:12865
	 cubic wscale:6,6 rto:200 rtt:0.25/0.25 ato:40 mss:1448 cwnd:10 send
463.4Mbps rcv_rtt:1 rcv_space:29200
tcp    ESTAB      0      390960           10.244.129.1:60247
10.244.129.2:50204
	 cubic wscale:6,6 rto:200 rtt:0.875/0.75 mss:1448 cwnd:73 ssthresh:51
send 966.4Mbps unacked:73 retrans:0/121 rcv_space:29200
Reported-by: NVytautas Valancius <valas@google.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4a5ab4e2

ipv4: Fix runtime WARNING in rtmsg_ifa() · 63b5f152

由 Geert Uytterhoeven 提交于 2月 05, 2014

On m68k/ARAnyM:

WARNING: CPU: 0 PID: 407 at net/ipv4/devinet.c:1599 0x316a99()
Modules linked in:
CPU: 0 PID: 407 Comm: ifconfig Not tainted
3.13.0-atari-09263-g0c71d68014d1 #1378
Stack from 10c4fdf0:
        10c4fdf0 002ffabb 000243e8 00000000 008ced6c 00024416 00316a99 0000063f
        00316a99 00000009 00000000 002501b4 00316a99 0000063f c0a86117 00000080
        c0a86117 00ad0c90 00250a5a 00000014 00ad0c90 00000000 00000000 00000001
        00b02dd0 00356594 00000000 00356594 c0a86117 eff6c9e4 008ced6c 00000002
        008ced60 0024f9b4 00250b52 00ad0c90 00000000 00000000 00252390 00ad0c90
        eff6c9e4 0000004f 00000000 00000000 eff6c9e4 8000e25c eff6c9e4 80001020
Call Trace: [<000243e8>] warn_slowpath_common+0x52/0x6c
 [<00024416>] warn_slowpath_null+0x14/0x1a
 [<002501b4>] rtmsg_ifa+0xdc/0xf0
 [<00250a5a>] __inet_insert_ifa+0xd6/0x1c2
 [<0024f9b4>] inet_abc_len+0x0/0x42
 [<00250b52>] inet_insert_ifa+0xc/0x12
 [<00252390>] devinet_ioctl+0x2ae/0x5d6

Adding some debugging code reveals that net_fill_ifaddr() fails in

    put_cacheinfo(skb, ifa->ifa_cstamp, ifa->ifa_tstamp,
                              preferred, valid))

nla_put complains:

    lib/nlattr.c:454: skb_tailroom(skb) = 12, nla_total_size(attrlen) = 20

Apparently commit 5c766d64 ("ipv4:
introduce address lifetime") forgot to take into account the addition of
struct ifa_cacheinfo in inet_nlmsg_size(). Hence add it, like is already
done for ipv6.
Suggested-by: NCong Wang <cwang@twopensource.com>
Signed-off-by: NGeert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: NCong Wang <cwang@twopensource.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

63b5f152

06 2月, 2014 3 次提交

netfilter: nf_tables: add reject module for NFPROTO_INET · 05513e9e

由 Patrick McHardy 提交于 2月 05, 2014

Add a reject module for NFPROTO_INET. It does nothing but dispatch
to the AF-specific modules based on the hook family.
Signed-off-by: NPatrick McHardy <kaber@trash.net>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

05513e9e

netfilter: nft_reject: split up reject module into IPv4 and IPv6 specifc parts · cc4723ca

由 Patrick McHardy 提交于 2月 05, 2014

Currently the nft_reject module depends on symbols from ipv6. This is
wrong since no generic module should force IPv6 support to be loaded.
Split up the module into AF-specific and a generic part.
Signed-off-by: NPatrick McHardy <kaber@trash.net>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

cc4723ca

netfilter: nf_nat_h323: fix crash in nf_ct_unlink_expect_report() · 829d9315

由 Alexey Dobriyan 提交于 2月 03, 2014

Similar bug fixed in SIP module in 3f509c68 ("netfilter: nf_nat_sip: fix
incorrect handling of EBUSY for RTCP expectation").

BUG: unable to handle kernel paging request at 00100104
IP: [<f8214f07>] nf_ct_unlink_expect_report+0x57/0xf0 [nf_conntrack]
...
Call Trace:
  [<c0244bd8>] ? del_timer+0x48/0x70
  [<f8215687>] nf_ct_remove_expectations+0x47/0x60 [nf_conntrack]
  [<f8211c99>] nf_ct_delete_from_lists+0x59/0x90 [nf_conntrack]
  [<f8212e5e>] death_by_timeout+0x14e/0x1c0 [nf_conntrack]
  [<f8212d10>] ? nf_conntrack_set_hashsize+0x190/0x190 [nf_conntrack]
  [<c024442d>] call_timer_fn+0x1d/0x80
  [<c024461e>] run_timer_softirq+0x18e/0x1a0
  [<f8212d10>] ? nf_conntrack_set_hashsize+0x190/0x190 [nf_conntrack]
  [<c023e6f3>] __do_softirq+0xa3/0x170
  [<c023e650>] ? __local_bh_enable+0x70/0x70
  <IRQ>
  [<c023e587>] ? irq_exit+0x67/0xa0
  [<c0202af6>] ? do_IRQ+0x46/0xb0
  [<c027ad05>] ? clockevents_notify+0x35/0x110
  [<c066ac6c>] ? common_interrupt+0x2c/0x40
  [<c056e3c1>] ? cpuidle_enter_state+0x41/0xf0
  [<c056e6fb>] ? cpuidle_idle_call+0x8b/0x100
  [<c02085f8>] ? arch_cpu_idle+0x8/0x30
  [<c027314b>] ? cpu_idle_loop+0x4b/0x140
  [<c0273258>] ? cpu_startup_entry+0x18/0x20
  [<c066056d>] ? rest_init+0x5d/0x70
  [<c0813ac8>] ? start_kernel+0x2ec/0x2f2
  [<c081364f>] ? repair_env_string+0x5b/0x5b
  [<c0813269>] ? i386_start_kernel+0x33/0x35
Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

829d9315

05 2月, 2014 1 次提交

net/ipv4: Use proper RCU APIs for writer-side in udp_offload.c · a664a4f7

由 Shlomo Pongratz 提交于 2月 02, 2014

RCU writer side should use rcu_dereference_protected() and not
rcu_dereference(), fix that. This also removes the "suspicious RCU usage"
warning seen when running with CONFIG_PROVE_RCU.

Also, don't use rcu_assign_pointer/rcu_dereference for pointers
which are invisible beyond the udp offload code.

Fixes: b582ef09 ('net: Add GRO support for UDP encapsulating protocols')
Reported-by: NEric Dumazet <edumazet@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: NShlomo Pongratz <shlomop@mellanox.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a664a4f7

04 2月, 2014 1 次提交

ip_tunnel: fix panic in ip_tunnel_xmit() · b045d37b

由 Eric Dumazet 提交于 2月 03, 2014

Setting rt variable to NULL at the beginning of ip_tunnel_xmit()
missed possible use of this variable as a scratch value.

Also fixes a possible dst leak in tunnel_dst_check() :
If we had to call tunnel_dst_reset(), we forgot to
release the reference on dst.

Merges tunnel_dst_get()/tunnel_dst_check() into
a single tunnel_rtable_get() function for clarity.

Many thanks to Tommi for his report and tests.

Fixes: 7d442fab ("ipv4: Cache dst in tunnels")
Reported-by: NTommi Rantala <tt.rantala@gmail.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Tested-by: NTommi Rantala <tt.rantala@gmail.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Maciej Żenczykowski <maze@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b045d37b

31 1月, 2014 1 次提交

net/ipv4: Use non-atomic allocation of udp offloads structure instance · b5aaab12

由 Or Gerlitz 提交于 1月 29, 2014

Since udp_add_offload() can be called from non-sleepable context e.g
under this call tree from the vxlan driver use case:

  vxlan_socket_create() <-- holds the spinlock
  -> vxlan_notify_add_rx_port()
     -> udp_add_offload()  <-- schedules

we should allocate the udp_offloads structure in atomic manner.

Fixes: b582ef09 ('net: Add GRO support for UDP encapsulating protocols')
Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b5aaab12

28 1月, 2014 3 次提交

net: gre: use icmp_hdr() to get inner ip header · c0c0c50f

由 Duan Jiong 提交于 1月 28, 2014

When dealing with icmp messages, the skb->data points the
ip header that triggered the sending of the icmp message.

In gre_cisco_err(), the parse_gre_header() is called, and the
iptunnel_pull_header() is called to pull the skb at the end of
the parse_gre_header(), so the skb->data doesn't point the
inner ip header.

Unfortunately, the ipgre_err still needs those ip addresses in
inner ip header to look up tunnel by ip_tunnel_lookup().

So just use icmp_hdr() to get inner ip header instead of skb->data.
Signed-off-by: NDuan Jiong <duanj.fnst@cn.fujitsu.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c0c0c50f

net: Fix memory leak if TPROXY used with TCP early demux · a452ce34

由 Holger Eitzenberger 提交于 1月 27, 2014

I see a memory leak when using a transparent HTTP proxy using TPROXY
together with TCP early demux and Kernel v3.8.13.15 (Ubuntu stable):

unreferenced object 0xffff88008cba4a40 (size 1696):
  comm "softirq", pid 0, jiffies 4294944115 (age 8907.520s)
  hex dump (first 32 bytes):
    0a e0 20 6a 40 04 1b 37 92 be 32 e2 e8 b4 00 00  .. j@..7..2.....
    02 00 07 01 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<ffffffff810b710a>] kmem_cache_alloc+0xad/0xb9
    [<ffffffff81270185>] sk_prot_alloc+0x29/0xc5
    [<ffffffff812702cf>] sk_clone_lock+0x14/0x283
    [<ffffffff812aaf3a>] inet_csk_clone_lock+0xf/0x7b
    [<ffffffff8129a893>] netlink_broadcast+0x14/0x16
    [<ffffffff812c1573>] tcp_create_openreq_child+0x1b/0x4c3
    [<ffffffff812c033e>] tcp_v4_syn_recv_sock+0x38/0x25d
    [<ffffffff812c13e4>] tcp_check_req+0x25c/0x3d0
    [<ffffffff812bf87a>] tcp_v4_do_rcv+0x287/0x40e
    [<ffffffff812a08a7>] ip_route_input_noref+0x843/0xa55
    [<ffffffff812bfeca>] tcp_v4_rcv+0x4c9/0x725
    [<ffffffff812a26f4>] ip_local_deliver_finish+0xe9/0x154
    [<ffffffff8127a927>] __netif_receive_skb+0x4b2/0x514
    [<ffffffff8127aa77>] process_backlog+0xee/0x1c5
    [<ffffffff8127c949>] net_rx_action+0xa7/0x200
    [<ffffffff81209d86>] add_interrupt_randomness+0x39/0x157

But there are many more, resulting in the machine going OOM after some
days.

From looking at the TPROXY code, and with help from Florian, I see
that the memory leak is introduced in tcp_v4_early_demux():

  void tcp_v4_early_demux(struct sk_buff *skb)
  {
    /* ... */

    iph = ip_hdr(skb);
    th = tcp_hdr(skb);

    if (th->doff < sizeof(struct tcphdr) / 4)
        return;

    sk = __inet_lookup_established(dev_net(skb->dev), &tcp_hashinfo,
                       iph->saddr, th->source,
                       iph->daddr, ntohs(th->dest),
                       skb->skb_iif);
    if (sk) {
        skb->sk = sk;

where the socket is assigned unconditionally to skb->sk, also bumping
the refcnt on it.  This is problematic, because in our case the skb
has already a socket assigned in the TPROXY target.  This then results
in the leak I see.

The very same issue seems to be with IPv6, but haven't tested.
Reviewed-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NHolger Eitzenberger <holger@eitzenberger.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a452ce34

net: ipv4: Use PTR_ERR_OR_ZERO · 27d79f3b

由 Sachin Kamat 提交于 1月 27, 2014

PTR_RET is deprecated. Use PTR_ERR_OR_ZERO instead. While at it
also include missing err.h header.
Signed-off-by: NSachin Kamat <sachin.kamat@linaro.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

27d79f3b

25 1月, 2014 1 次提交

fib_frontend: fix possible NULL pointer dereference · a0065f26

由 Oliver Hartkopp 提交于 1月 23, 2014

The two commits 0115e8e3 (net: remove delay at device dismantle) and
748e2d93 (net: reinstate rtnl in call_netdevice_notifiers()) silently
removed a NULL pointer check for in_dev since Linux 3.7.

This patch re-introduces this check as it causes crashing the kernel when
setting small mtu values on non-ip capable netdevices.
Signed-off-by: NOliver Hartkopp <socketcan@hartkopp.net>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a0065f26

24 1月, 2014 3 次提交

remove extra definitions of U32_MAX · 04f9b74e

由 Alex Elder 提交于 1月 23, 2014

Now that the definition is centralized in <linux/kernel.h>, the
definitions of U32_MAX (and related) elsewhere in the kernel can be
removed.
Signed-off-by: NAlex Elder <elder@linaro.org>
Acked-by: NSage Weil <sage@inktank.com>
Acked-by: NDavid S. Miller <davem@davemloft.net>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

04f9b74e

conditionally define U32_MAX · 77719536

由 Alex Elder 提交于 1月 23, 2014

The symbol U32_MAX is defined in several spots.  Change these
definitions to be conditional.  This is in preparation for the next
patch, which centralizes the definition in <linux/kernel.h>.
Signed-off-by: NAlex Elder <elder@linaro.org>
Cc: Sage Weil <sage@inktank.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

77719536

ip_tunnel: clear IPCB in ip_tunnel_xmit() in case dst_link_failure() is called · 11c21a30

由 Duan Jiong 提交于 1月 23, 2014

commit a6222602("ip_tunnel: fix kernel panic with icmp_dest_unreach")
clear IPCB in ip_tunnel_xmit()  , or else skb->cb[] may contain garbage from
GSO segmentation layer.

But commit 0e6fbc5b("ip_tunnels: extend iptunnel_xmit()") refactor codes,
and it clear IPCB behind the dst_link_failure().

So clear IPCB in ip_tunnel_xmit() just like commti a6222602("ip_tunnel:
fix kernel panic with icmp_dest_unreach").
Signed-off-by: NDuan Jiong <duanj.fnst@cn.fujitsu.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

11c21a30

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功