提交 · 5719b296fb81502d0dbbb4e87b3235e5bdcdfc6b · openeuler / raspberrypi-kernel

27 7月, 2015 3 次提交

inet: frag: don't wait for timer deletion when evicting · 5719b296

由 Florian Westphal 提交于 7月 23, 2015

Frank reports 'NMI watchdog: BUG: soft lockup' errors when
load is high.  Instead of (potentially) unbounded restarts of the
eviction process, just skip to the next entry.

One caveat is that, when a netns is exiting, a timer may still be running
by the time inet_evict_bucket returns.

We use the frag memory accounting to wait for outstanding timers,
so that when we free the percpu counter we can be sure no running
timer will trip over it.
Reported-and-tested-by: NFrank Schreuder <fschreuder@transip.nl>
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5719b296

inet: frag: change *_frag_mem_limit functions to take netns_frags as argument · 0e60d245

由 Florian Westphal 提交于 7月 23, 2015

Followup patch will call it after inet_frag_queue was freed, so q->net
doesn't work anymore (but netf = q->net; free(q); mem_limit(netf) would).
Tested-by: NFrank Schreuder <fschreuder@transip.nl>
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0e60d245

inet: frag: don't re-use chainlist for evictor · d1fe1944

由 Florian Westphal 提交于 7月 23, 2015

commit 65ba1f1e ("inet: frags: fix a race between inet_evict_bucket
and inet_frag_kill") describes the bug, but the fix doesn't work reliably.

Problem is that ->flags member can be set on other cpu without chainlock
being held by that task, i.e. the RMW-Cycle can clear INET_FRAG_EVICTED
bit after we put the element on the evictor private list.

We can crash when walking the 'private' evictor list since an element can
be deleted from list underneath the evictor.

Join work with Nikolay Alexandrov.

Fixes: b13d3cbf ("inet: frag: move eviction of queues to work queue")
Reported-by: NJohan Schuijt <johan@transip.nl>
Tested-by: NFrank Schreuder <fschreuder@transip.nl>
Signed-off-by: NNikolay Alexandrov <nikolay@cumulusnetworks.com>
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d1fe1944

25 7月, 2015 2 次提交

ipv4: consider TOS in fib_select_default · 2392debc

由 Julian Anastasov 提交于 7月 22, 2015

fib_select_default considers alternative routes only when
res->fi is for the first alias in res->fa_head. In the
common case this can happen only when the initial lookup
matches the first alias with highest TOS value. This
prevents the alternative routes to require specific TOS.

This patch solves the problem as follows:

- routes that require specific TOS should be returned by
fib_select_default only when TOS matches, as already done
in fib_table_lookup. This rule implies that depending on the
TOS we can have many different lists of alternative gateways
and we have to keep the last used gateway (fa_default) in first
alias for the TOS instead of using single tb_default value.

- as the aliases are ordered by many keys (TOS desc,
fib_priority asc), we restrict the possible results to
routes with matching TOS and lowest metric (fib_priority)
and routes that match any TOS, again with lowest metric.

For example, packet with TOS 8 can not use gw3 (not lowest
metric), gw4 (different TOS) and gw6 (not lowest metric),
all other gateways can be used:

tos 8 via gw1 metric 2 <--- res->fa_head and res->fi
tos 8 via gw2 metric 2
tos 8 via gw3 metric 3
tos 4 via gw4
tos 0 via gw5
tos 0 via gw6 metric 1
Reported-by: NHagen Paul Pfeifer <hagen@jauu.net>
Signed-off-by: NJulian Anastasov <ja@ssi.bg>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2392debc

ipv4: fib_select_default should match the prefix · 18a912e9

由 Julian Anastasov 提交于 7月 22, 2015

fib_trie starting from 4.1 can link fib aliases from
different prefixes in same list. Make sure the alternative
gateways are in same table and for same prefix (0) by
checking tb_id and fa_slen.

Fixes: 79e5ad2c ("fib_trie: Remove leaf_info")
Signed-off-by: NJulian Anastasov <ja@ssi.bg>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

18a912e9

22 7月, 2015 2 次提交

tcp: suppress a division by zero warning · 89e478a2

由 Eric Dumazet 提交于 7月 22, 2015

Andrew Morton reported following warning on one ARM build
with gcc-4.4 :

net/ipv4/inet_hashtables.c: In function 'inet_ehash_locks_alloc':
net/ipv4/inet_hashtables.c:617: warning: division by zero

Even guarded with a test on sizeof(spinlock_t), compiler does not
like current construct on a !CONFIG_SMP build.

Remove the warning by using a temporary variable.

Fixes: 095dc8e0 ("tcp: fix/cleanup inet_ehash_locks_alloc()")
Reported-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

89e478a2

inet: frags: fix defragmented packet's IP header for af_packet · 0848f642

由 Edward Hyunkoo Jee 提交于 7月 21, 2015

When ip_frag_queue() computes positions, it assumes that the passed
sk_buff does not contain L2 headers.

However, when PACKET_FANOUT_FLAG_DEFRAG is used, IP reassembly
functions can be called on outgoing packets that contain L2 headers.

Also, IPv4 checksum is not corrected after reassembly.

Fixes: 7736d33f ("packet: Add pre-defragmentation support for ipv4 fanouts.")
Signed-off-by: NEdward Hyunkoo Jee <edjee@google.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jerry Chu <hkchu@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0848f642

16 7月, 2015 2 次提交

ipv6: lock socket in ip6_datagram_connect() · 03645a11

由 Eric Dumazet 提交于 7月 14, 2015

ip6_datagram_connect() is doing a lot of socket changes without
socket being locked.

This looks wrong, at least for udp_lib_rehash() which could corrupt
lists because of concurrent udp_sk(sk)->udp_portaddr_hash accesses.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Acked-by: NHerbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

03645a11

tcp: don't use F-RTO on non-recurring timeouts · f82b681a

由 Yuchung Cheng 提交于 7月 13, 2015

Currently F-RTO may repeatedly send new data packets on non-recurring
timeouts in CA_Loss mode. This is a bug because F-RTO (RFC5682)
should only be used on either new recovery or recurring timeouts.

This exacerbates the recovery progress during frequent timeout &
repair, because we prioritize sending new data packets instead of
repairing the holes when the bandwidth is already scarce.

Fix it by correcting the test of a new recovery episode.
Signed-off-by: NYuchung Cheng <ycheng@google.com>
Signed-off-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f82b681a

11 7月, 2015 1 次提交

net: inet_diag: always export IPV6_V6ONLY sockopt for listening sockets · 8220ea23

由 Phil Sutter 提交于 7月 10, 2015

Reconsidering my commit 20462155 "net: inet_diag: export IPV6_V6ONLY
sockopt", I am not happy with the limitations it causes for socket
analysing code in userspace. Exporting the value only if it is set makes
it hard for userspace to decide whether the option is not set or the
kernel does not support exporting the option at all.

>From an auditor's perspective, the interesting question for listening
AF_INET6 sockets is: "Does it NOT have IPV6_V6ONLY set?" Because it is
the unexpected case. This patch allows to answer this question reliably.
Signed-off-by: NPhil Sutter <phil@nwl.cc>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8220ea23

09 7月, 2015 2 次提交

ipv4: add support for linkdown sysctl to netconf · 974d7af5

由 Andy Gospodarek 提交于 7月 07, 2015

This kernel patch exports the value of the new
ignore_routes_with_linkdown via netconf.

v2: changes to notify userspace via netlink when sysctl values change
and proposed for 'net' since this could be considered a bugfix
Signed-off-by: NAndy Gospodarek <gospo@cumulusnetworks.com>
Suggested-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

974d7af5

ip_tunnel: fix ipv4 pmtu check to honor inner ip header df · fc24f2b2

由 Timo Teräs 提交于 7月 07, 2015

Frag needed should be sent only if the inner header asked
to not fragment. Currently fragmentation is broken if the
tunnel has df set, but df was not asked in the original
packet. The tunnel's df needs to be still checked to update
internally the pmtu cache.

Commit 23a3647b broke it, and this commit fixes
the ipv4 df check back to the way it was.

Fixes: 23a3647b ("ip_tunnels: Use skb-len to PMTU check.")
Cc: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: NTimo Teräs <timo.teras@iki.fi>
Acked-by: NPravin B Shelar <pshelar@nicira.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fc24f2b2

02 7月, 2015 1 次提交

netfilter: arptables: use percpu jumpstack · 3bd22997

由 Florian Westphal 提交于 6月 30, 2015

commit 482cfc31 ("netfilter: xtables: avoid percpu ruleset duplication")

Unlike ip and ip6tables, arp tables were never converted to use the percpu
jump stack.

It still uses the rule blob to store return address, which isn't safe
anymore since we now share this blob among all processors.

Because there is no TEE support for arptables, we don't need to cope
with reentrancy, so we can use loocal variable to hold stack offset.

Fixes: 482cfc31 ("netfilter: xtables: avoid percpu ruleset duplication")
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

3bd22997

29 6月, 2015 1 次提交

ipv4: fix RCU lockdep warning from linkdown changes · 96ac5cc9

由 Andy Gospodarek 提交于 6月 26, 2015

The following lockdep splat was seen due to the wrong context for
grabbing in_dev.

===============================
[ INFO: suspicious RCU usage. ]
4.1.0-next-20150626-dbg-00020-g54a6d91-dirty #244 Not tainted
-------------------------------
include/linux/inetdevice.h:205 suspicious rcu_dereference_check() usage!

other info that might help us debug this:

rcu_scheduler_active = 1, debug_locks = 0
2 locks held by ip/403:
 #0:  (rtnl_mutex){+.+.+.}, at: [<ffffffff81453305>] rtnl_lock+0x17/0x19
 #1:  ((inetaddr_chain).rwsem){.+.+.+}, at: [<ffffffff8105c327>] __blocking_notifier_call_chain+0x35/0x6a

stack backtrace:
CPU: 2 PID: 403 Comm: ip Not tainted 4.1.0-next-20150626-dbg-00020-g54a6d91-dirty #244
 0000000000000001 ffff8800b189b728 ffffffff8150a542 ffffffff8107a8b3
 ffff880037bbea40 ffff8800b189b758 ffffffff8107cb74 ffff8800379dbd00
 ffff8800bec85800 ffff8800bf9e13c0 00000000000000ff ffff8800b189b7d8
Call Trace:
 [<ffffffff8150a542>] dump_stack+0x4c/0x6e
 [<ffffffff8107a8b3>] ? up+0x39/0x3e
 [<ffffffff8107cb74>] lockdep_rcu_suspicious+0xf7/0x100
 [<ffffffff814b63c3>] fib_dump_info+0x227/0x3e2
 [<ffffffff814b6624>] rtmsg_fib+0xa6/0x116
 [<ffffffff814b978f>] fib_table_insert+0x316/0x355
 [<ffffffff814b362e>] fib_magic+0xb7/0xc7
 [<ffffffff814b4803>] fib_add_ifaddr+0xb1/0x13b
 [<ffffffff814b4d09>] fib_inetaddr_event+0x36/0x90
 [<ffffffff8105c086>] notifier_call_chain+0x4c/0x71
 [<ffffffff8105c340>] __blocking_notifier_call_chain+0x4e/0x6a
 [<ffffffff8105c370>] blocking_notifier_call_chain+0x14/0x16
 [<ffffffff814a7f50>] __inet_insert_ifa+0x1a5/0x1b3
 [<ffffffff814a894d>] inet_rtm_newaddr+0x350/0x35f
 [<ffffffff81457866>] rtnetlink_rcv_msg+0x17b/0x18a
 [<ffffffff8107e7c3>] ? trace_hardirqs_on+0xd/0xf
 [<ffffffff8146965f>] ? netlink_deliver_tap+0x1cb/0x1f7
 [<ffffffff814576eb>] ? rtnl_newlink+0x72a/0x72a
...

This patch resolves that splat.
Signed-off-by: NAndy Gospodarek <gospo@cumulusnetworks.com>
Reported-by: NSergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

96ac5cc9

24 6月, 2015 4 次提交

net: inet_diag: export IPV6_V6ONLY sockopt · 20462155

由 Phil Sutter 提交于 6月 24, 2015

For AF_INET6 sockets, the value of struct ipv6_pinfo.ipv6only is
exported to userspace. It indicates whether a socket bound to in6addr_any
listens on IPv4 as well as IPv6. Since the socket is natively IPv6, it is not
listed by e.g. 'ss -l -4'.

This patch is accompanied by an appropriate one for iproute2 to enable
the additional information in 'ss -e'.
Signed-off-by: NPhil Sutter <phil@nwl.cc>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

20462155

net: ipv4 sysctl option to ignore routes when nexthop link is down · 0eeb075f

由 Andy Gospodarek 提交于 6月 23, 2015

This feature is only enabled with the new per-interface or ipv4 global
sysctls called 'ignore_routes_with_linkdown'.

net.ipv4.conf.all.ignore_routes_with_linkdown = 0
net.ipv4.conf.default.ignore_routes_with_linkdown = 0
net.ipv4.conf.lo.ignore_routes_with_linkdown = 0
...

When the above sysctls are set, will report to userspace that a route is
dead and will no longer resolve to this nexthop when performing a fib
lookup.  This will signal to userspace that the route will not be
selected.  The signalling of a RTNH_F_DEAD is only passed to userspace
if the sysctl is enabled and link is down.  This was done as without it
the netlink listeners would have no idea whether or not a nexthop would
be selected.   The kernel only sets RTNH_F_DEAD internally if the
interface has IFF_UP cleared.

With the new sysctl set, the following behavior can be observed
(interface p8p1 is link-down):

default via 10.0.5.2 dev p9p1
10.0.5.0/24 dev p9p1  proto kernel  scope link  src 10.0.5.15
70.0.0.0/24 dev p7p1  proto kernel  scope link  src 70.0.0.1
80.0.0.0/24 dev p8p1  proto kernel  scope link  src 80.0.0.1 dead linkdown
90.0.0.0/24 via 80.0.0.2 dev p8p1  metric 1 dead linkdown
90.0.0.0/24 via 70.0.0.2 dev p7p1  metric 2
90.0.0.1 via 70.0.0.2 dev p7p1  src 70.0.0.1
    cache
local 80.0.0.1 dev lo  src 80.0.0.1
    cache <local>
80.0.0.2 via 10.0.5.2 dev p9p1  src 10.0.5.15
    cache

While the route does remain in the table (so it can be modified if
needed rather than being wiped away as it would be if IFF_UP was
cleared), the proper next-hop is chosen automatically when the link is
down.  Now interface p8p1 is linked-up:

default via 10.0.5.2 dev p9p1
10.0.5.0/24 dev p9p1  proto kernel  scope link  src 10.0.5.15
70.0.0.0/24 dev p7p1  proto kernel  scope link  src 70.0.0.1
80.0.0.0/24 dev p8p1  proto kernel  scope link  src 80.0.0.1
90.0.0.0/24 via 80.0.0.2 dev p8p1  metric 1
90.0.0.0/24 via 70.0.0.2 dev p7p1  metric 2
192.168.56.0/24 dev p2p1  proto kernel  scope link  src 192.168.56.2
90.0.0.1 via 80.0.0.2 dev p8p1  src 80.0.0.1
    cache
local 80.0.0.1 dev lo  src 80.0.0.1
    cache <local>
80.0.0.2 dev p8p1  src 80.0.0.1
    cache

and the output changes to what one would expect.

If the sysctl is not set, the following output would be expected when
p8p1 is down:

default via 10.0.5.2 dev p9p1
10.0.5.0/24 dev p9p1  proto kernel  scope link  src 10.0.5.15
70.0.0.0/24 dev p7p1  proto kernel  scope link  src 70.0.0.1
80.0.0.0/24 dev p8p1  proto kernel  scope link  src 80.0.0.1 linkdown
90.0.0.0/24 via 80.0.0.2 dev p8p1  metric 1 linkdown
90.0.0.0/24 via 70.0.0.2 dev p7p1  metric 2

Since the dead flag does not appear, there should be no expectation that
the kernel would skip using this route due to link being down.

v2: Split kernel changes into 2 patches, this actually makes a
behavioral change if the sysctl is set.  Also took suggestion from Alex
to simplify code by only checking sysctl during fib lookup and
suggestion from Scott to add a per-interface sysctl.

v3: Code clean-ups to make it more readable and efficient as well as a
reverse path check fix.

v4: Drop binary sysctl

v5: Whitespace fixups from Dave

v6: Style changes from Dave and checkpatch suggestions

v7: One more checkpatch fixup
Signed-off-by: NAndy Gospodarek <gospo@cumulusnetworks.com>
Signed-off-by: NDinesh Dutt <ddutt@cumulusnetworks.com>
Acked-by: NScott Feldman <sfeldma@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0eeb075f

net: track link-status of ipv4 nexthops · 8a3d0316

由 Andy Gospodarek 提交于 6月 23, 2015

Add a fib flag called RTNH_F_LINKDOWN to any ipv4 nexthops that are
reachable via an interface where carrier is off.  No action is taken,
but additional flags are passed to userspace to indicate carrier status.

This also includes a cleanup to fib_disable_ip to more clearly indicate
what event made the function call to replace the more cryptic force
option previously used.

v2: Split out kernel functionality into 2 patches, this patch simply
sets and clears new nexthop flag RTNH_F_LINKDOWN.

v3: Cleanups suggested by Alex as well as a bug noticed in
fib_sync_down_dev and fib_sync_up when multipath was not enabled.

v5: Whitespace and variable declaration fixups suggested by Dave.

v6: Style fixups noticed by Dave; ran checkpatch to be sure I got them
all.
Signed-off-by: NAndy Gospodarek <gospo@cumulusnetworks.com>
Signed-off-by: NDinesh Dutt <ddutt@cumulusnetworks.com>
Acked-by: NScott Feldman <sfeldma@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8a3d0316

ip: report the original address of ICMP messages · 34b99df4

由 Julian Anastasov 提交于 6月 23, 2015

ICMP messages can trigger ICMP and local errors. In this case
serr->port is 0 and starting from Linux 4.0 we do not return
the original target address to the error queue readers.
Add function to define which errors provide addr_offset.
With this fix my ping command is not silent anymore.

Fixes: c247f053 ("ip: fix error queue empty skb handling")
Signed-off-by: NJulian Anastasov <ja@ssi.bg>
Acked-by: NWillem de Bruijn <willemb@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

34b99df4

23 6月, 2015 2 次提交

tcp: Do not call tcp_fastopen_reset_cipher from interrupt context · dfea2aa6

由 Christoph Paasch 提交于 6月 18, 2015

tcp_fastopen_reset_cipher really cannot be called from interrupt
context. It allocates the tcp_fastopen_context with GFP_KERNEL and
calls crypto_alloc_cipher, which allocates all kind of stuff with
GFP_KERNEL.

Thus, we might sleep when the key-generation is triggered by an
incoming TFO cookie-request which would then happen in interrupt-
context, as shown by enabling CONFIG_DEBUG_ATOMIC_SLEEP:

[   36.001813] BUG: sleeping function called from invalid context at mm/slub.c:1266
[   36.003624] in_atomic(): 1, irqs_disabled(): 0, pid: 1016, name: packetdrill
[   36.004859] CPU: 1 PID: 1016 Comm: packetdrill Not tainted 4.1.0-rc7 #14
[   36.006085] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[   36.008250]  00000000000004f2 ffff88007f8838a8 ffffffff8171d53a ffff880075a084a8
[   36.009630]  ffff880075a08000 ffff88007f8838c8 ffffffff810967d3 ffff88007f883928
[   36.011076]  0000000000000000 ffff88007f8838f8 ffffffff81096892 ffff88007f89be00
[   36.012494] Call Trace:
[   36.012953]  <IRQ>  [<ffffffff8171d53a>] dump_stack+0x4f/0x6d
[   36.014085]  [<ffffffff810967d3>] ___might_sleep+0x103/0x170
[   36.015117]  [<ffffffff81096892>] __might_sleep+0x52/0x90
[   36.016117]  [<ffffffff8118e887>] kmem_cache_alloc_trace+0x47/0x190
[   36.017266]  [<ffffffff81680d82>] ? tcp_fastopen_reset_cipher+0x42/0x130
[   36.018485]  [<ffffffff81680d82>] tcp_fastopen_reset_cipher+0x42/0x130
[   36.019679]  [<ffffffff81680f01>] tcp_fastopen_init_key_once+0x61/0x70
[   36.020884]  [<ffffffff81680f2c>] __tcp_fastopen_cookie_gen+0x1c/0x60
[   36.022058]  [<ffffffff816814ff>] tcp_try_fastopen+0x58f/0x730
[   36.023118]  [<ffffffff81671788>] tcp_conn_request+0x3e8/0x7b0
[   36.024185]  [<ffffffff810e3872>] ? __module_text_address+0x12/0x60
[   36.025327]  [<ffffffff8167b2e1>] tcp_v4_conn_request+0x51/0x60
[   36.026410]  [<ffffffff816727e0>] tcp_rcv_state_process+0x190/0xda0
[   36.027556]  [<ffffffff81661f97>] ? __inet_lookup_established+0x47/0x170
[   36.028784]  [<ffffffff8167c2ad>] tcp_v4_do_rcv+0x16d/0x3d0
[   36.029832]  [<ffffffff812e6806>] ? security_sock_rcv_skb+0x16/0x20
[   36.030936]  [<ffffffff8167cc8a>] tcp_v4_rcv+0x77a/0x7b0
[   36.031875]  [<ffffffff816af8c3>] ? iptable_filter_hook+0x33/0x70
[   36.032953]  [<ffffffff81657d22>] ip_local_deliver_finish+0x92/0x1f0
[   36.034065]  [<ffffffff81657f1a>] ip_local_deliver+0x9a/0xb0
[   36.035069]  [<ffffffff81657c90>] ? ip_rcv+0x3d0/0x3d0
[   36.035963]  [<ffffffff81657569>] ip_rcv_finish+0x119/0x330
[   36.036950]  [<ffffffff81657ba7>] ip_rcv+0x2e7/0x3d0
[   36.037847]  [<ffffffff81610652>] __netif_receive_skb_core+0x552/0x930
[   36.038994]  [<ffffffff81610a57>] __netif_receive_skb+0x27/0x70
[   36.040033]  [<ffffffff81610b72>] process_backlog+0xd2/0x1f0
[   36.041025]  [<ffffffff81611482>] net_rx_action+0x122/0x310
[   36.042007]  [<ffffffff81076743>] __do_softirq+0x103/0x2f0
[   36.042978]  [<ffffffff81723e3c>] do_softirq_own_stack+0x1c/0x30

This patch moves the call to tcp_fastopen_init_key_once to the places
where a listener socket creates its TFO-state, which always happens in
user-context (either from the setsockopt, or implicitly during the
listen()-call)

Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Fixes: 222e83d2 ("tcp: switch tcp_fastopen key generation to net_get_random_once")
Signed-off-by: NChristoph Paasch <cpaasch@apple.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

dfea2aa6

inet_diag: Remove _bh suffix in inet_diag_dump_reqs(). · 3b188443

由 Hiroaki SHIMODA 提交于 6月 18, 2015

inet_diag_dump_reqs() is called from inet_diag_dump_icsk() with BH
disabled. So no need to disable BH in inet_diag_dump_reqs().
Signed-off-by: NHiroaki Shimoda <shimoda.hiroaki@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3b188443

22 6月, 2015 2 次提交

ipv4: include NLM_F_APPEND flag in append route notifications · a2bb6d7d

由 Roopa Prabhu 提交于 6月 17, 2015

This patch adds NLM_F_APPEND flag to struct nlmsg_hdr->nlmsg_flags
in newroute notifications if the route add was an append.
(This is similar to how NLM_F_REPLACE is already part of new
route replace notifications today)

This helps userspace determine if the route add operation was
an append.
Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
Acked-by: NScott Feldman <sfeldma@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a2bb6d7d

sock_diag: fetch source port from inet_sock · e0df02e0

由 Craig Gallek 提交于 6月 17, 2015

When an inet_sock is destroyed, its source port (sk_num) is set to
zero as part of the unhash procedure.  In order to supply a source
port as part of the NETLINK_SOCK_DIAG socket destruction broadcasts,
the source port number must be read from inet_sport instead.

Tested: ss -E
Signed-off-by: NCraig Gallek <kraig@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e0df02e0

17 6月, 2015 1 次提交

netfilter: don't use module_init/exit in core IPV4 code · 55331060

由 Paul Gortmaker 提交于 5月 01, 2015

The file net/ipv4/netfilter.o is created based on whether
CONFIG_NETFILTER is set.  However that is defined as a bool, and
hence this file with the core netfilter hooks will never be
modular.  So using module_init as an alias for __initcall can be
somewhat misleading.

Fix this up now, so that we can relocate module_init from
init.h into module.h in the future.  If we don't do this, we'd
have to add module.h to obviously non-modular code, and that
would be a worse thing.  Also add an inclusion of init.h, as
that was previously implicit here in the netfilter.c file.

Note that direct use of __initcall is discouraged, vs. one
of the priority categorized subgroups.  As __initcall gets
mapped onto device_initcall, our use of subsys_initcall (which
seems to make sense for netfilter code) will thus change this
registration from level 6-device to level 4-subsys (i.e. slightly
earlier).  However no observable impact of that small difference
has been observed during testing, or is expected. (i.e. the
location of the netfilter messages in dmesg remains unchanged
with respect to all the other surrounding messages.)

As for the module_exit, rather than replace it with __exitcall,
we simply remove it, since it appears only UML does anything
with those, and even for UML, there is no relevant cleanup
to be done here.

Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: NPablo Neira Ayuso <pablo@netfilter.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: netfilter-devel@vger.kernel.org
Cc: netdev@vger.kernel.org
Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>

55331060

16 6月, 2015 3 次提交

sock_diag: implement a get_info handler for inet · 35ac838a

由 Craig Gallek 提交于 6月 15, 2015

This get_info handler will simply dispatch to the appropriate
existing inet protocol handler.

This patch also includes a new netlink attribute
(INET_DIAG_PROTOCOL).  This attribute is currently only used
for multicast messages.  Without this attribute, there is no
way of knowing the IP protocol used by the socket information
being broadcast.  This attribute is not necessary in the 'dump'
variant of this protocol (though it could easily be added)
because dump requests are issued for specific family/protocol
pairs.

Tested: ss -E (note, the -E option has not yet been merged into
the upstream version of ss).
Signed-off-by: NCraig Gallek <kraig@google.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

35ac838a

sock_diag: specify info_size per inet protocol · 3fd22af8

由 Craig Gallek 提交于 6月 15, 2015

Previously, there was no clear distinction between the inet protocols
that used struct tcp_info to report information and those that didn't.
This change adds a specific size attribute to the inet_diag_handler
struct which defines these interfaces.  This will make dispatching
sock_diag get_info requests identical for all inet protocols in a
following patch.

Tested: ss -au
Tested: ss -at
Signed-off-by: NCraig Gallek <kraig@google.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3fd22af8

netfilter: x_tables: remove XT_TABLE_INFO_SZ and a dereference. · 711bdde6

由 Eric Dumazet 提交于 6月 15, 2015

After Florian patches, there is no need for XT_TABLE_INFO_SZ anymore :
Only one copy of table is kept, instead of one copy per cpu.

We also can avoid a dereference if we put table data right after
xt_table_info. It reduces register pressure and helps compiler.

Then, we attempt a kmalloc() if total size is under order-3 allocation,
to reduce TLB pressure, as in many cases, rules fit in 32 KB.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

711bdde6

15 6月, 2015 2 次提交

netfilter: Kconfig: get rid of parens around depends on · f09becc7

由 Pablo Neira Ayuso 提交于 6月 12, 2015

According to the reporter, they are not needed.
Reported-by: NSergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

f09becc7

tcp: cdg: use div_u64() · 758f0d4b

由 Kenneth Klette Jonassen 提交于 6月 12, 2015

Fixes cross-compile to mips.
Signed-off-by: NKenneth Klette Jonassen <kennetkl@ifi.uio.no>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

758f0d4b

13 6月, 2015 1 次提交

net: ipv4: un-inline ip_finish_output2 · b60f2f3d

由 Florian Westphal 提交于 6月 12, 2015

text    data     bss     dec     hex filename
old: 16527      44       0   16571    40bb net/ipv4/ip_output.o
new: 14935      44       0   14979    3a83 net/ipv4/ip_output.o
Suggested-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b60f2f3d

12 6月, 2015 8 次提交

netfilter: xtables: avoid percpu ruleset duplication · 482cfc31

由 Florian Westphal 提交于 6月 11, 2015

We store the rule blob per (possible) cpu.  Unfortunately this means we can
waste lot of memory on big smp machines. ipt_entry structure ('rule head')
is 112 byte, so e.g. with maxcpu=64 one single rule eats
close to 8k RAM.

Since previous patch made counters percpu it appears there is nothing
left in the rule blob that needs to be percpu.

On my test system (144 possible cpus, 400k dummy rules) this
change saves close to 9 Gigabyte of RAM.
Reported-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

482cfc31

netfilter: xtables: use percpu rule counters · 71ae0dff

由 Florian Westphal 提交于 6月 11, 2015

The binary arp/ip/ip6tables ruleset is stored per cpu.

The only reason left as to why we need percpu duplication are the rule
counters embedded into ipt_entry et al -- since each cpu has its own copy
of the rules, all counters can be lockless.

The downside is that the more cpus are supported, the more memory is
required.  Rules are not just duplicated per online cpu but for each
possible cpu, i.e. if maxcpu is 144, then rule is duplicated 144 times,
not for the e.g. 64 cores present.

To save some memory and also improve utilization of shared caches it
would be preferable to only store the rule blob once.

So we first need to separate counters and the rule blob.

Instead of using entry->counters, allocate this percpu and store the
percpu address in entry->counters.pcnt on CONFIG_SMP.

This change makes no sense as-is; it is merely an intermediate step to
remove the percpu duplication of the rule set in a followup patch.
Suggested-by: NEric Dumazet <edumazet@google.com>
Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
Reported-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

71ae0dff

net: ip_fragment: remove BRIDGE_NETFILTER mtu special handling · 33b1f313

由 Florian Westphal 提交于 6月 05, 2015

since commit d6b915e2
("ip_fragment: don't forward defragmented DF packet") the largest
fragment size is available in the IPCB.

Therefore we no longer need to care about 'encapsulation'
overhead of stripped PPPOE/VLAN headers since ip_do_fragment
doesn't use device mtu in such cases.
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

33b1f313

tcp: remove obsolete check in tcp_set_skb_tso_segs() · b5e2c457