提交 · 168fc21a971e4bc821a7838ccc39b7bdaf316c11 · openeuler / raspberrypi-kernel

20 5月, 2013 1 次提交

tcp: remove bad timeout logic in fast recovery · 3e59cb0d

由 Yuchung Cheng 提交于 5月 17, 2013

tcp_timeout_skb() was intended to trigger fast recovery on timeout,
unfortunately in reality it often causes spurious retransmission
storms during fast recovery. The particular sign is a fast retransmit
over the highest sacked sequence (SND.FACK).

Currently the RTO timer re-arming (as in RFC6298) offers a nice cushion
to avoid spurious timeout: when SND.UNA advances the sender re-arms
RTO and extends the timeout by icsk_rto. The sender does not offset
the time elapsed since the packet at SND.UNA was sent.

But if the next (DUP)ACK arrives later than ~RTTVAR and triggers
tcp_fastretrans_alert(), then tcp_timeout_skb() will mark any packet
sent before the icsk_rto interval lost, including one that's above the
highest sacked sequence. Most likely a large part of scorebard will be
marked.

If most packets are not lost then the subsequent DUPACKs with new SACK
blocks will cause the sender to continue to retransmit packets beyond
SND.FACK spuriously. Even if only one packet is lost the sender may
falsely retransmit almost the entire window.

The situation becomes common in the world of bufferbloat: the RTT
continues to grow as the queue builds up but RTTVAR remains small and
close to the minimum 200ms. If a data packet is lost and the DUPACK
triggered by the next data packet is slightly delayed, then a spurious
retransmission storm forms.

As the original comment on tcp_timeout_skb() suggests: the usefulness
of this feature is questionable. It also wastes cycles walking the
sack scoreboard and is actually harmful because of false recovery.

It's time to remove this.
Signed-off-by: NYuchung Cheng <ycheng@google.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Acked-by: NNeal Cardwell <ncardwell@google.com>
Acked-by: NNandita Dukkipati <nanditad@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3e59cb0d

17 5月, 2013 1 次提交

tcp: speedup tcp_fixup_rcvbuf() · d2cf4367

由 Eric Dumazet 提交于 5月 15, 2013

tcp_fixup_rcvbuf() contains a loop to estimate initial socket
rcv space needed for a given mss. With large MTU (like 64K on lo),
we can loop ~500 times and consume a lot of cpu cycles.

perf top of 200 concurrent netperf -t TCP_CRR

5.62%  netperf  [kernel.kallsyms]  [k] tcp_init_buffer_space
1.71%  netperf  [kernel.kallsyms]  [k] _raw_spin_lock
1.55%  netperf  [kernel.kallsyms]  [k] kmem_cache_free
1.51%  netperf  [kernel.kallsyms]  [k] tcp_transmit_skb
1.50%  netperf  [kernel.kallsyms]  [k] tcp_ack

Lets use a 100% factor, and remove the loop.

100% is needed anyway for tcp_adv_win_scale=1
default value, and is also the maximum factor.

Refs: commit b49960a0
      ("tcp: change tcp_adv_win_scale and tcp_rmem[2]")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d2cf4367

12 5月, 2013 1 次提交

ipv4: ip_output: remove inline marking of EXPORT_SYMBOL functions · 2fbd9679

由 Denis Efremov 提交于 5月 08, 2013

EXPORT_SYMBOL and inline directives are contradictory to each other.
The patch fixes this inconsistency.

Found by Linux Driver Verification project (linuxtesting.org).
Signed-off-by: NDenis Efremov <yefremov.denis@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2fbd9679

09 5月, 2013 1 次提交

gso: Handle Trans-Ether-Bridging protocol in skb_network_protocol() · 19acc327

由 Pravin B Shelar 提交于 5月 07, 2013

Rather than having logic to calculate inner protocol in every
tunnel gso handler move it to gso code. This simplifies code.

Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Cong Wang <amwang@redhat.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: NPravin B Shelar <pshelar@nicira.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

19acc327

06 5月, 2013 3 次提交

fib_trie: no need to delay vfree() · 00203563

由 Al Viro 提交于 5月 05, 2013

Now that vfree() can be called from interrupt contexts, there's no
need to play games with schedule_work() to escape calling vfree()
from RCU callbacks.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

00203563

net: frag, fix race conditions in LRU list maintenance · b56141ab

由 Konstantin Khlebnikov 提交于 5月 05, 2013

This patch fixes race between inet_frag_lru_move() and inet_frag_lru_add()
which was introduced in commit 3ef0eb0d
("net: frag, move LRU list maintenance outside of rwlock")

One cpu already added new fragment queue into hash but not into LRU.
Other cpu found it in hash and tries to move it to the end of LRU.
This leads to NULL pointer dereference inside of list_move_tail().

Another possible race condition is between inet_frag_lru_move() and
inet_frag_lru_del(): move can happens after deletion.

This patch initializes LRU list head before adding fragment into hash and
inet_frag_lru_move() doesn't touches it if it's empty.

I saw this kernel oops two times in a couple of days.

[119482.128853] BUG: unable to handle kernel NULL pointer dereference at           (null)
[119482.132693] IP: [<ffffffff812ede89>] __list_del_entry+0x29/0xd0
[119482.136456] PGD 2148f6067 PUD 215ab9067 PMD 0
[119482.140221] Oops: 0000 [#1] SMP
[119482.144008] Modules linked in: vfat msdos fat 8021q fuse nfsd auth_rpcgss nfs_acl nfs lockd sunrpc ppp_async ppp_generic bridge slhc stp llc w83627ehf hwmon_vid snd_hda_codec_hdmi snd_hda_codec_realtek kvm_amd k10temp kvm snd_hda_intel snd_hda_codec edac_core radeon snd_hwdep ath9k snd_pcm ath9k_common snd_page_alloc ath9k_hw snd_timer snd soundcore drm_kms_helper ath ttm r8169 mii
[119482.152692] CPU 3
[119482.152721] Pid: 20, comm: ksoftirqd/3 Not tainted 3.9.0-zurg-00001-g9f95269 #132 To Be Filled By O.E.M. To Be Filled By O.E.M./RS880D
[119482.161478] RIP: 0010:[<ffffffff812ede89>]  [<ffffffff812ede89>] __list_del_entry+0x29/0xd0
[119482.166004] RSP: 0018:ffff880216d5db58  EFLAGS: 00010207
[119482.170568] RAX: 0000000000000000 RBX: ffff88020882b9c0 RCX: dead000000200200
[119482.175189] RDX: 0000000000000000 RSI: 0000000000000880 RDI: ffff88020882ba00
[119482.179860] RBP: ffff880216d5db58 R08: ffffffff8155c7f0 R09: 0000000000000014
[119482.184570] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88020882ba00
[119482.189337] R13: ffffffff81c8d780 R14: ffff880204357f00 R15: 00000000000005a0
[119482.194140] FS:  00007f58124dc700(0000) GS:ffff88021fcc0000(0000) knlGS:0000000000000000
[119482.198928] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[119482.203711] CR2: 0000000000000000 CR3: 00000002155f0000 CR4: 00000000000007e0
[119482.208533] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[119482.213371] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[119482.218221] Process ksoftirqd/3 (pid: 20, threadinfo ffff880216d5c000, task ffff880216d3a9a0)
[119482.223113] Stack:
[119482.228004]  ffff880216d5dbd8 ffffffff8155dcda 0000000000000000 ffff000200000001
[119482.233038]  ffff8802153c1f00 ffff880000289440 ffff880200000014 ffff88007bc72000
[119482.238083]  00000000000079d5 ffff88007bc72f44 ffffffff00000002 ffff880204357f00
[119482.243090] Call Trace:
[119482.248009]  [<ffffffff8155dcda>] ip_defrag+0x8fa/0xd10
[119482.252921]  [<ffffffff815a8013>] ipv4_conntrack_defrag+0x83/0xe0
[119482.257803]  [<ffffffff8154485b>] nf_iterate+0x8b/0xa0
[119482.262658]  [<ffffffff8155c7f0>] ? inet_del_offload+0x40/0x40
[119482.267527]  [<ffffffff815448e4>] nf_hook_slow+0x74/0x130
[119482.272412]  [<ffffffff8155c7f0>] ? inet_del_offload+0x40/0x40
[119482.277302]  [<ffffffff8155d068>] ip_rcv+0x268/0x320
[119482.282147]  [<ffffffff81519992>] __netif_receive_skb_core+0x612/0x7e0
[119482.286998]  [<ffffffff81519b78>] __netif_receive_skb+0x18/0x60
[119482.291826]  [<ffffffff8151a650>] process_backlog+0xa0/0x160
[119482.296648]  [<ffffffff81519f29>] net_rx_action+0x139/0x220
[119482.301403]  [<ffffffff81053707>] __do_softirq+0xe7/0x220
[119482.306103]  [<ffffffff81053868>] run_ksoftirqd+0x28/0x40
[119482.310809]  [<ffffffff81074f5f>] smpboot_thread_fn+0xff/0x1a0
[119482.315515]  [<ffffffff81074e60>] ? lg_local_lock_cpu+0x40/0x40
[119482.320219]  [<ffffffff8106d870>] kthread+0xc0/0xd0
[119482.324858]  [<ffffffff8106d7b0>] ? insert_kthread_work+0x40/0x40
[119482.329460]  [<ffffffff816c32dc>] ret_from_fork+0x7c/0xb0
[119482.334057]  [<ffffffff8106d7b0>] ? insert_kthread_work+0x40/0x40
[119482.338661] Code: 00 00 55 48 8b 17 48 b9 00 01 10 00 00 00 ad de 48 8b 47 08 48 89 e5 48 39 ca 74 29 48 b9 00 02 20 00 00 00 ad de 48 39 c8 74 7a <4c> 8b 00 4c 39 c7 75 53 4c 8b 42 08 4c 39 c7 75 2b 48 89 42 08
[119482.343787] RIP  [<ffffffff812ede89>] __list_del_entry+0x29/0xd0
[119482.348675]  RSP <ffff880216d5db58>
[119482.353493] CR2: 0000000000000000

Oops happened on this path:
ip_defrag() -> ip_frag_queue() -> inet_frag_lru_move() -> list_move_tail() -> __list_del_entry()
Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Florian Westphal <fw@strlen.de>
Cc: Eric Dumazet <edumazet@google.com>
Cc: David S. Miller <davem@davemloft.net>
Acked-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b56141ab

tcp: do not expire TCP fastopen cookies · efeaa555

由 Eric Dumazet 提交于 5月 03, 2013

TCP metric cache expires entries after one hour.

This probably make sense for TCP RTT/RTTVAR/CWND, but not
for TCP fastopen cookies.

Its better to try previous cookie. If it appears to be obsolete,
server will send us new cookie anyway.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

efeaa555

04 5月, 2013 2 次提交

vxlan: Fix TCPv6 segmentation. · 0d05535d

由 Pravin B Shelar 提交于 5月 02, 2013

This patch set correct skb->protocol so that inner packet can
lookup correct gso handler.
Signed-off-by: NPravin B Shelar <pshelar@nicira.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0d05535d

gre: Fix GREv4 TCPv6 segmentation. · 9b3eb5ed

由 Pravin B Shelar 提交于 5月 02, 2013

For ipv6 traffic, GRE can generate packet with strange GSO
bits, e.g. ipv4 packet with SKB_GSO_TCPV6 flag set.  Therefore
following patch relaxes check in inet gso handler to allow
such packet for segmentation.
This patch also fixes wrong skb->protocol set that was done in
gre_gso_segment() handler.
Reported-by: NSteinar H. Gunderson <sesse@google.com>
CC: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NPravin B Shelar <pshelar@nicira.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9b3eb5ed

02 5月, 2013 1 次提交

proc: Supply a function to remove a proc entry by PDE · a8ca16ea

由 David Howells 提交于 4月 12, 2013

Supply a function (proc_remove()) to remove a proc entry (and any subtree
rooted there) by proc_dir_entry pointer rather than by name and (optionally)
root dir entry pointer.  This allows us to eliminate all remaining pde->name
accesses outside of procfs.
Signed-off-by: NDavid Howells <dhowells@redhat.com>
Acked-by: NGrant Likely <grant.likely@linaro.or>
cc: linux-acpi@vger.kernel.org
cc: openipmi-developer@lists.sourceforge.net
cc: devicetree-discuss@lists.ozlabs.org
cc: linux-pci@vger.kernel.org
cc: netdev@vger.kernel.org
cc: netfilter-devel@vger.kernel.org
cc: alsa-devel@alsa-project.org
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

a8ca16ea

30 4月, 2013 3 次提交

tcp: reset timer after any SYNACK retransmit · cd75eff6

由 Yuchung Cheng 提交于 4月 29, 2013

Linux immediately returns SYNACK on (spurious) SYN retransmits, but
keeps the SYNACK timer running independently. Thus the timer may
fire right after the SYNACK retransmit and causes a SYN-SYNACK
cross-fire burst.

Adopt the fast retransmit/recovery idea in established state by
re-arming the SYNACK timer after the fast (SYNACK) retransmit. The
timer may fire late up to 500ms due to the current SYNACK timer wheel,
but it's OK to be conservative when network is congested. Eric's new
listener design should address this issue.
Signed-off-by: NYuchung Cheng <ycheng@google.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Acked-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

cd75eff6

net: Add MIB counters for checksum errors · 6a5dc9e5

由 Eric Dumazet 提交于 4月 29, 2013

Add MIB counters for checksum errors in IP layer,
and TCP/UDP/ICMP layers, to help diagnose problems.

$ nstat -a | grep  Csum
IcmpInCsumErrors                72                 0.0
TcpInCsumErrors                 382                0.0
UdpInCsumErrors                 463221             0.0
Icmp6InCsumErrors               75                 0.0
Udp6InCsumErrors                173442             0.0
IpExtInCsumErrors               10884              0.0
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6a5dc9e5

net: defer net_secret[] initialization · aebda156

由 Eric Dumazet 提交于 4月 29, 2013

Instead of feeding net_secret[] at boot time, defer the init
at the point first socket is created.

This permits some platforms to use better entropy sources than
the ones available at boot time.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

aebda156

25 4月, 2013 1 次提交

net: ipv4: typo issue, remove erroneous semicolon · 2bac7cb3

由 Chen Gang 提交于 4月 22, 2013

Need remove erroneous semicolon, which is found by EXTRA_CFLAGS=-W,
the related commit number: c5441932
("GRE: Refactor GRE tunneling code")
Signed-off-by: NChen Gang <gang.chen@asianux.com>
Acked-by: NPravin B Shelar <pshelar@nicira.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2bac7cb3

20 4月, 2013 2 次提交

netlink: rename ssk to sk in struct netlink_skb_params · e32123e5

由 Patrick McHardy 提交于 4月 17, 2013

Memory mapped netlink needs to store the receiving userspace socket
when sending from the kernel to userspace. Rename 'ssk' to 'sk' to
avoid confusion.
Signed-off-by: NPatrick McHardy <kaber@trash.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e32123e5

tcp: call tcp_replace_ts_recent() from tcp_ack() · 12fb3dd9

由 Eric Dumazet 提交于 4月 19, 2013

commit bd090dfc (tcp: tcp_replace_ts_recent() should not be called
from tcp_validate_incoming()) introduced a TS ecr bug in slow path
processing.

1 A > B P. 1:10001(10000) ack 1 <nop,nop,TS val 1001 ecr 200>
2 B < A . 1:1(0) ack 1 win 257 <sack 9001:10001,TS val 300 ecr 1001>
3 A > B . 1:1001(1000) ack 1 win 227 <nop,nop,TS val 1002 ecr 200>
4 A > B . 1001:2001(1000) ack 1 win 227 <nop,nop,TS val 1002 ecr 200>

(ecr 200 should be ecr 300 in packets 3 & 4)

Problem is tcp_ack() can trigger send of new packets (retransmits),
reflecting the prior TSval, instead of the TSval contained in the
currently processed incoming packet.

Fix this by calling tcp_replace_ts_recent() from tcp_ack() after the
checks, but before the actions.
Reported-by: NYuchung Cheng <ycheng@google.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

12fb3dd9

19 4月, 2013 4 次提交

netfilter: xt_rpfilter: depend on raw or mangle table · d37d6968

由 Florian Westphal 提交于 4月 17, 2013

rpfilter is only valid in raw/mangle PREROUTING, i.e.
RPFILTER=y|m is useless without raw or mangle table support.
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

d37d6968

netfilter: xt_rpfilter: skip locally generated broadcast/multicast, too · f83a7ea2

由 Florian Westphal 提交于 4月 17, 2013

Alex Efros reported rpfilter module doesn't match following packets:
IN=br.qemu SRC=192.168.2.1 DST=192.168.2.255 [ .. ]
(netfilter bugzilla #814).

Problem is that network stack arranges for the locally generated broadcasts
to appear on the interface they were sent out, so the IFF_LOOPBACK check
doesn't trigger.

As -m rpfilter is restricted to PREROUTING, we can check for existing
rtable instead, it catches locally-generated broad/multicast case, too.
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

f83a7ea2

tcp: introduce TCPSpuriousRtxHostQueues SNMP counter · 0e280af0

由 Eric Dumazet 提交于 4月 18, 2013

Host queues (Qdisc + NIC) can hold packets so long that TCP can
eventually retransmit a packet before the first transmit even left
the host.

Its not clear right now if we could avoid this in the first place :

- We could arm RTO timer not at the time we enqueue packets, but
  at the time we TX complete them (tcp_wfree())

- Cancel the sending of the new copy of the packet if prior one
  is still in queue.

This patch adds instrumentation so that we can at least see how
often this problem happens.

TCPSpuriousRtxHostQueues SNMP counter is incremented every time
we detect the fast clone is not yet freed in tcp_transmit_skb()
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0e280af0

netfilter: add my copyright statements · f229f6ce

由 Patrick McHardy 提交于 4月 06, 2013

Add copyright statements to all netfilter files which have had significant
changes done by myself in the past.

Some notes:

- nf_conntrack_ecache.c was incorrectly attributed to Rusty and Netfilter
  Core Team when it got split out of nf_conntrack_core.c. The copyrights
  even state a date which lies six years before it was written. It was
  written in 2005 by Harald and myself.

- net/ipv{4,6}/netfilter.c, net/netfitler/nf_queue.c were missing copyright
  statements. I've added the copyright statement from net/netfilter/core.c,
  where this code originated

- for nf_conntrack_proto_tcp.c I've also added Jozsef, since I didn't want
  it to give the wrong impression
Signed-off-by: NPatrick McHardy <kaber@trash.net>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

f229f6ce

17 4月, 2013 1 次提交

net: drop dst before queueing fragments · 97599dc7

由 Eric Dumazet 提交于 4月 16, 2013

Commit 4a94445c (net: Use ip_route_input_noref() in input path)
added a bug in IP defragmentation handling, as non refcounted
dst could escape an RCU protected section.

Commit 64f3b9e2 (net: ip_expire() must revalidate route) fixed
the case of timeouts, but not the general problem.

Tom Parkin noticed crashes in UDP stack and provided a patch,
but further analysis permitted us to pinpoint the root cause.

Before queueing a packet into a frag list, we must drop its dst,
as this dst has limited lifetime (RCU protected)

When/if a packet is finally reassembled, we use the dst of the very
last skb, still protected by RCU and valid, as the dst of the
reassembled packet.

Use same logic in IPv6, as there is no need to hold dst references.
Reported-by: NTom Parkin <tparkin@katalix.com>
Tested-by: NTom Parkin <tparkin@katalix.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

97599dc7

16 4月, 2013 1 次提交

esp4: fix error return code in esp_output() · 06848c10

由 Wei Yongjun 提交于 4月 13, 2013

Fix to return a negative error code from the error handling
case instead of 0, as returned elsewhere in this function.
Signed-off-by: NWei Yongjun <yongjun_wei@trendmicro.com.cn>
Acked-by: NSteffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

06848c10

15 4月, 2013 2 次提交

net: tcp_memcontrol: minor: remove unused variable · 4731d011

由 Daniel Borkmann 提交于 4月 14, 2013

Commit 10b96f73 (``tcp_memcontrol: remove a redundant statement
in tcp_destroy_cgroup()'') says ``We read the value but make no use
of it.'', but forgot to remove the variable declaration as well. This
was a follow-up commit of 3f134619 (``memcg: decrement static keys
at real destroy time'') that removed the read of variable 'val'.

This fixes therefore:

  CC      net/ipv4/tcp_memcontrol.o
net/ipv4/tcp_memcontrol.c: In function ‘tcp_destroy_cgroup’:
net/ipv4/tcp_memcontrol.c:67:6: warning: unused variable ‘val’ [-Wunused-variable]
Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4731d011

net: sock: make sock_tx_timestamp void · bf84a010

由 Daniel Borkmann 提交于 4月 14, 2013

Currently, sock_tx_timestamp() always returns 0. The comment that
describes the sock_tx_timestamp() function wrongly says that it
returns an error when an invalid argument is passed (from commit
20d49473, ``net: socket infrastructure for SO_TIMESTAMPING'').
Make the function void, so that we can also remove all the unneeded
if conditions that check for such a _non-existant_ error case in the
output path.
Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

bf84a010

14 4月, 2013 1 次提交

tcp: tcp_tso_segment() small optimization · bece1b97

由 Eric Dumazet 提交于 4月 13, 2013

We can move th->check computation out of the loop, as compiler
doesn't know each skb initially share same tcp headers after
skb_segment()
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

bece1b97

13 4月, 2013 1 次提交

tcp: GSO should be TSQ friendly · d6a4a104

由 Eric Dumazet 提交于 4月 12, 2013

I noticed that TSQ (TCP Small queues) was less effective when TSO is
turned off, and GSO is on. If BQL is not enabled, TSQ has then no
effect.

It turns out the GSO engine frees the original gso_skb at the time the
fragments are generated and queued to the NIC.

We should instead call the tcp_wfree() destructor for the last fragment,
to keep the flow control as intended in TSQ. This effectively limits
the number of queued packets on qdisc + NIC layers.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Nandita Dukkipati <nanditad@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d6a4a104

12 4月, 2013 2 次提交

tcp: Reallocate headroom if it would overflow csum_start · 50bceae9

由 Thomas Graf 提交于 4月 11, 2013

If a TCP retransmission gets partially ACKed and collapsed multiple
times it is possible for the headroom to grow beyond 64K which will
overflow the 16bit skb->csum_start which is based on the start of
the headroom. It has been observed rarely in the wild with IPoIB due
to the 64K MTU.

Verify if the acking and collapsing resulted in a headroom exceeding
what csum_start can cover and reallocate the headroom if so.

A big thank you to Jim Foraker <foraker1@llnl.gov> and the team at
LLNL for helping out with the investigation and testing.
Reported-by: NJim Foraker <foraker1@llnl.gov>
Signed-off-by: NThomas Graf <tgraf@suug.ch>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

50bceae9

tcp: incoming connections might use wrong route under synflood · d66954a0

由 Dmitry Popov 提交于 4月 11, 2013

There is a bug in cookie_v4_check (net/ipv4/syncookies.c):
	flowi4_init_output(&fl4, 0, sk->sk_mark, RT_CONN_FLAGS(sk),
			   RT_SCOPE_UNIVERSE, IPPROTO_TCP,
			   inet_sk_flowi_flags(sk),
			   (opt && opt->srr) ? opt->faddr : ireq->rmt_addr,
			   ireq->loc_addr, th->source, th->dest);

Here we do not respect sk->sk_bound_dev_if, therefore wrong dst_entry may be
taken. This dst_entry is used by new socket (get_cookie_sock ->
tcp_v4_syn_recv_sock), so its packets may take the wrong path.
Signed-off-by: NDmitry Popov <dp@highloadlab.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d66954a0

10 4月, 2013 3 次提交

procfs: new helper - PDE_DATA(inode) · d9dda78b

由 Al Viro 提交于 3月 31, 2013

The only part of proc_dir_entry the code outside of fs/proc
really cares about is PDE(inode)->data.  Provide a helper
for that; static inline for now, eventually will be moved
to fs/proc, along with the knowledge of struct proc_dir_entry
layout.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

d9dda78b

selinux: add a skb_owned_by() hook · ca10b9e9

由 Eric Dumazet 提交于 4月 08, 2013

Commit 90ba9b19 (tcp: tcp_make_synack() can use alloc_skb())
broke certain SELinux/NetLabel configurations by no longer correctly
assigning the sock to the outgoing SYNACK packet.

Cost of atomic operations on the LISTEN socket is quite big,
and we would like it to happen only if really needed.

This patch introduces a new security_ops->skb_owned_by() method,
that is a void operation unless selinux is active.
Reported-by: NMiroslav Vadkerti <mvadkert@redhat.com>
Diagnosed-by: NPaul Moore <pmoore@redhat.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: linux-security-module@vger.kernel.org
Acked-by: NJames Morris <james.l.morris@oracle.com>
Tested-by: NPaul Moore <pmoore@redhat.com>
Acked-by: NPaul Moore <pmoore@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ca10b9e9

tcp_memcontrol: remove a redundant statement in tcp_destroy_cgroup() · 10b96f73

由 Zefan Li 提交于 4月 08, 2013

We read the value but make no use of it.
Signed-off-by: NLi Zefan <lizefan@huawei.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

10b96f73

09 4月, 2013 3 次提交

net: ipv4: fix schedule while atomic bug in check_lifetime() · c988d1e8

由 Jiri Pirko 提交于 4月 04, 2013

move might_sleep operations out of the rcu_read_lock() section.
Also fix iterating over ifa_dev->ifa_list

Introduced by: commit 5c766d64 "ipv4: introduce address lifetime"
Signed-off-by: NJiri Pirko <jiri@resnulli.us>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c988d1e8

net: ipv4: reset check_lifetime_work after changing lifetime · 05a324b9

由 Jiri Pirko 提交于 4月 04, 2013

This will result in calling check_lifetime in nearest opportunity and
that function will adjust next time to call check_lifetime correctly.
Without this, check_lifetime is called in time computed by previous run,
not affecting modified lifetime.

Introduced by: commit 5c766d64 "ipv4: introduce address lifetime"
Signed-off-by: NJiri Pirko <jiri@resnulli.us>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

05a324b9

ip_gre: fix a possible crash in parse_gre_header() · 22251c73

由 Eric Dumazet 提交于 4月 04, 2013

pskb_may_pull() can change skb->head, so we must init iph/greh after
calling it.

Bug added in commit c5441932 (GRE: Refactor GRE tunneling code.)
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

22251c73

08 4月, 2013 2 次提交

netfilter: nat: propagate errors from xfrm_me_harder() · aaa795ad

由 Patrick McHardy 提交于 4月 05, 2013

Propagate errors from ip_xfrm_me_harder() instead of returning EPERM in
all cases.
Signed-off-by: NPatrick McHardy <kaber@trash.net>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

aaa795ad

netfilter: ipv4: propagate routing errors from ip_route_me_harder() · c9e1673a

由 Patrick McHardy 提交于 4月 05, 2013

Propagate routing errors from ip_route_me_harder() when dropping a packet
using NF_DROP_ERR(). This makes userspace get the proper error instead of
EPERM for everything.

Example:

# ip r a unreachable default table 100
# ip ru add fwmark 0x1 lookup 100
# iptables -t mangle -A OUTPUT -d 8.8.8.8 -j MARK --set-mark 0x1

Current behaviour:

PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
ping: sendmsg: Operation not permitted
ping: sendmsg: Operation not permitted
ping: sendmsg: Operation not permitted

New behaviour:

PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
ping: sendmsg: Network is unreachable
ping: sendmsg: Network is unreachable
ping: sendmsg: Network is unreachable
ping: sendmsg: Network is unreachable
Signed-off-by: NPatrick McHardy <kaber@trash.net>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

c9e1673a

06 4月, 2013 2 次提交

netfilter: ipt_ULOG: add net namespace support for ipt_ULOG · 35543067

由 Gao feng 提交于 3月 24, 2013

Add pernet support to ipt_ULOG by means of the new nf_log_set
function added in (30e0c6a6 netfilter: nf_log: prepare net
namespace support for loggers).

This patch also make ulog_buffers and netlink socket
nflognl per netns.
Signed-off-by: NGao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

35543067

netfilter: nf_log: prepare net namespace support for loggers · 30e0c6a6

由 Gao feng 提交于 3月 24, 2013

This patch adds netns support to nf_log and it prepares netns
support for existing loggers. It is composed of four major
changes.

1) nf_log_register has been split to two functions: nf_log_register
   and nf_log_set. The new nf_log_register is used to globally
   register the nf_logger and nf_log_set is used for enabling
   pernet support from nf_loggers.

   Per netns is not yet complete after this patch, it comes in
   separate follow up patches.

2) Add net as a parameter of nf_log_bind_pf. Per netns is not
   yet complete after this patch, it only allows to bind the
   nf_logger to the protocol family from init_net and it skips
   other cases.

3) Adapt all nf_log_packet callers to pass netns as parameter.
   After this patch, this function only works for init_net.

4) Make the sysctl net/netfilter/nf_log pernet.
Signed-off-by: NGao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

30e0c6a6

05 4月, 2013 2 次提交

net: ipv4: notify when address lifetime changes · 34e2ed34

由 Jiri Pirko 提交于 4月 04, 2013

if userspace changes lifetime of address, send netlink notification and
call notifier.
Signed-off-by: NJiri Pirko <jiri@resnulli.us>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

34e2ed34

net: frag queue per hash bucket locking · 19952cc4

由 Jesper Dangaard Brouer 提交于 4月 03, 2013

This patch implements per hash bucket locking for the frag queue
hash.  This removes two write locks, and the only remaining write
lock is for protecting hash rebuild.  This essentially reduce the
readers-writer lock to a rebuild lock.

This patch is part of "net: frag performance followup"
 http://thread.gmane.org/gmane.linux.network/263644
of which two patches have already been accepted:

Same test setup as previous:
 (http://thread.gmane.org/gmane.linux.network/257155)
 Two 10G interfaces, on seperate NUMA nodes, are under-test, and uses
 Ethernet flow-control.  A third interface is used for generating the
 DoS attack (with trafgen).

Notice, I have changed the frag DoS generator script to be more
efficient/deadly.  Before it would only hit one RX queue, now its
sending packets causing multi-queue RX, due to "better" RX hashing.

Test types summary (netperf UDP_STREAM):
 Test-20G64K     == 2x10G with 65K fragments
 Test-20G3F      == 2x10G with 3x fragments (3*1472 bytes)
 Test-20G64K+DoS == Same as 20G64K with frag DoS
 Test-20G3F+DoS  == Same as 20G3F  with frag DoS
 Test-20G64K+MQ  == Same as 20G64K with Multi-Queue frag DoS
 Test-20G3F+MQ   == Same as 20G3F  with Multi-Queue frag DoS

When I rebased this-patch(03) (on top of net-next commit a210576c) and
removed the _bh spinlock, I saw a performance regression.  BUT this
was caused by some unrelated change in-between.  See tests below.

Test (A) is what I reported before for patch-02, accepted in commit 1b5ab0de.
Test (B) verifying-retest of commit 1b5ab0de corrospond to patch-02.
Test (C) is what I reported before for this-patch

Test (D) is net-next master HEAD (commit a210576c), which reveals some
(unknown) performance regression (compared against test (B)).
Test (D) function as a new base-test.

Performance table summary (in Mbit/s):

(#) Test-type:  20G64K    20G3F    20G64K+DoS  20G3F+DoS  20G64K+MQ 20G3F+MQ
    ----------  -------   -------  ----------  ---------  --------  -------
(A) Patch-02  : 18848.7   13230.1   4103.04     5310.36     130.0    440.2
(B) 1b5ab0de  : 18841.5   13156.8   4101.08     5314.57     129.0    424.2
(C) Patch-03v1: 18838.0   13490.5   4405.11     6814.72     196.6    461.6

(D) a210576c  : 18321.5   11250.4   3635.34     5160.13     119.1    405.2
(E) with _bh  : 17247.3   11492.6   3994.74     6405.29     166.7    413.6
(F) without bh: 17471.3   11298.7   3818.05     6102.11     165.7    406.3

Test (E) and (F) is this-patch(03), with(V1) and without(V2) the _bh spinlocks.

I cannot explain the slow down for 20G64K (but its an artificial
"lab-test" so I'm not worried).  But the other results does show
improvements.  And test (E) "with _bh" version is slightly better.
Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: NEric Dumazet <edumazet@google.com>

----
V2:
- By analysis from Hannes Frederic Sowa and Eric Dumazet, we don't
  need the spinlock _bh versions, as Netfilter currently does a
  local_bh_disable() before entering inet_fragment.
- Fold-in desc from cover-mail
V3:
- Drop the chain_len counter per hash bucket.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

19952cc4