提交 · 52fe51f8523751da0e79c85350c47eb3bb94da5b · openeuler / Kernel

10 9月, 2015 3 次提交

ipv6: fix ifnullfree.cocci warnings · 52fe51f8

由 Wu Fengguang 提交于 9月 10, 2015

net/ipv6/route.c:2946:3-8: WARNING: NULL check before freeing functions like kfree, debugfs_remove, debugfs_remove_recursive or usb_free_urb is not needed. Maybe consider reorganizing relevant code to avoid passing NULL values.

NULL check before some freeing functions is not needed.

Based on checkpatch warning
"kfree(NULL) is safe this check is probably not required"
and kfreeaddr.cocci by Julia Lawall.

Generated by: scripts/coccinelle/free/ifnullfree.cocci

CC: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: NFengguang Wu <fengguang.wu@intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

52fe51f8

net: ipv6: use common fib_default_rule_pref · f53de1e9

由 Phil Sutter 提交于 9月 09, 2015

This switches IPv6 policy routing to use the shared
fib_default_rule_pref() function of IPv4 and DECnet. It is also used in
multicast routing for IPv4 as well as IPv6.

The motivation for this patch is a complaint about iproute2 behaving
inconsistent between IPv4 and IPv6 when adding policy rules: Formerly,
IPv6 rules were assigned a fixed priority of 0x3FFF whereas for IPv4 the
assigned priority value was decreased with each rule added.

Since then all users of the default_pref field have been converted to
assign the generic function fib_default_rule_pref(), fib_nl_newrule()
may just use it directly instead. Therefore get rid of the function
pointer altogether and make fib_default_rule_pref() static, as it's not
used outside fib_rules.c anymore.
Signed-off-by: NPhil Sutter <phil@nwl.cc>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f53de1e9

ipv6: fix multipath route replace error recovery · 6b9ea5a6

由 Roopa Prabhu 提交于 9月 08, 2015

Problem:
The ecmp route replace support for ipv6 in the kernel, deletes the
existing ecmp route too early, ie when it installs the first nexthop.
If there is an error in installing the subsequent nexthops, its too late
to recover the already deleted existing route leaving the fib
in an inconsistent state.

This patch reduces the possibility of this by doing the following:
a) Changes the existing multipath route add code to a two stage process:
  build rt6_infos + insert them
	ip6_route_add rt6_info creation code is moved into
	ip6_route_info_create.
b) This ensures that most errors are caught during building rt6_infos
  and we fail early
c) Separates multipath add and del code. Because add needs the special
  two stage mode in a) and delete essentially does not care.
d) In any event if the code fails during inserting a route again, a
  warning is printed (This should be unlikely)

Before the patch:
$ip -6 route show
3000:1000:1000:1000::2 via fe80::202:ff:fe00:b dev swp49s0 metric 1024
3000:1000:1000:1000::2 via fe80::202:ff:fe00:d dev swp49s1 metric 1024
3000:1000:1000:1000::2 via fe80::202:ff:fe00:f dev swp49s2 metric 1024

/* Try replacing the route with a duplicate nexthop */
$ip -6 route change 3000:1000:1000:1000::2/128 nexthop via
fe80::202:ff:fe00:b dev swp49s0 nexthop via fe80::202:ff:fe00:d dev
swp49s1 nexthop via fe80::202:ff:fe00:d dev swp49s1
RTNETLINK answers: File exists

$ip -6 route show
/* previously added ecmp route 3000:1000:1000:1000::2 dissappears from
 * kernel */

After the patch:
$ip -6 route show
3000:1000:1000:1000::2 via fe80::202:ff:fe00:b dev swp49s0 metric 1024
3000:1000:1000:1000::2 via fe80::202:ff:fe00:d dev swp49s1 metric 1024
3000:1000:1000:1000::2 via fe80::202:ff:fe00:f dev swp49s2 metric 1024

/* Try replacing the route with a duplicate nexthop */
$ip -6 route change 3000:1000:1000:1000::2/128 nexthop via
fe80::202:ff:fe00:b dev swp49s0 nexthop via fe80::202:ff:fe00:d dev
swp49s1 nexthop via fe80::202:ff:fe00:d dev swp49s1
RTNETLINK answers: File exists

$ip -6 route show
3000:1000:1000:1000::2 via fe80::202:ff:fe00:b dev swp49s0 metric 1024
3000:1000:1000:1000::2 via fe80::202:ff:fe00:d dev swp49s1 metric 1024
3000:1000:1000:1000::2 via fe80::202:ff:fe00:f dev swp49s2 metric 1024

Fixes: 27596472 ("ipv6: fix ECMP route replacement")
Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
Reviewed-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
Acked-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6b9ea5a6

06 9月, 2015 1 次提交

net/ipv6: Correct PIM6 mrt_lock handling · 25b4a44c

由 Richard Laing 提交于 9月 03, 2015

In the IPv6 multicast routing code the mrt_lock was not being released
correctly in the MFC iterator, as a result adding or deleting a MIF would
cause a hang because the mrt_lock could not be acquired.

This fix is a copy of the code for the IPv4 case and ensures that the lock
is released correctly.
Signed-off-by: NRichard Laing <richard.laing@alliedtelesis.co.nz>
Acked-by: NCong Wang <cwang@twopensource.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

25b4a44c

03 9月, 2015 2 次提交

netfilter: nf_dup{4, 6}: fix build error when nf_conntrack disabled · a82b0e63

由 Daniel Borkmann 提交于 9月 02, 2015

While testing various Kconfig options on another issue, I found that
the following one triggers as well on allmodconfig and nf_conntrack
disabled:

  net/ipv4/netfilter/nf_dup_ipv4.c: In function ‘nf_dup_ipv4’:
  net/ipv4/netfilter/nf_dup_ipv4.c:72:20: error: ‘nf_skb_duplicated’ undeclared (first use in this function)
    if (this_cpu_read(nf_skb_duplicated))
  [...]
  net/ipv6/netfilter/nf_dup_ipv6.c: In function ‘nf_dup_ipv6’:
  net/ipv6/netfilter/nf_dup_ipv6.c:66:20: error: ‘nf_skb_duplicated’ undeclared (first use in this function)
    if (this_cpu_read(nf_skb_duplicated))

Fix it by including directly the header where it is defined.

Fixes: bbde9fc1 ("netfilter: factor out packet duplication for IPv4/IPv6")
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a82b0e63

ipv6: fix exthdrs offload registration in out_rt path · e41b0bed

由 Daniel Borkmann 提交于 9月 03, 2015

We previously register IPPROTO_ROUTING offload under inet6_add_offload(),
but in error path, we try to unregister it with inet_del_offload(). This
doesn't seem correct, it should actually be inet6_del_offload(), also
ipv6_exthdrs_offload_exit() from that commit seems rather incorrect (it
also uses rthdr_offload twice), but it got removed entirely later on.

Fixes: 3336288a ("ipv6: Switch to using new offload infrastructure.")
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e41b0bed

01 9月, 2015 5 次提交

M
ipv6: send only one NEWLINK when RA causes changes · 2053aeb6
由 Marius Tomaschewski 提交于 9月 01, 2015
```
Signed-off-by: NMarius Tomaschewski <mt@suse.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
2053aeb6

ipv6: send NEWLINK on RA managed/otherconf changes · a394eef5

由 Marius Tomaschewski 提交于 8月 31, 2015

The kernel is applying the RA managed/otherconf flags silently and
forgets to send ifinfo notify to inform about their change when the
router provides a zero reachable_time and retrans_timer as dnsmasq
and many routers send it, which just means unspecified by this router
and the host should continue using whatever value it is already using.
Userspace may monitor the ifinfo notifications to activate dhcpv6.
Signed-off-by: NMarius Tomaschewski <mt@suse.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a394eef5

tcp: use dctcp if enabled on the route to the initiator · c3a8d947

由 Daniel Borkmann 提交于 8月 31, 2015

Currently, the following case doesn't use DCTCP, even if it should:
A responder has f.e. Cubic as system wide default, but for a specific
route to the initiating host, DCTCP is being set in RTAX_CC_ALGO. The
initiating host then uses DCTCP as congestion control, but since the
initiator sets ECT(0), tcp_ecn_create_request() doesn't set ecn_ok,
and we have to fall back to Reno after 3WHS completes.

We were thinking on how to solve this in a minimal, non-intrusive
way without bloating tcp_ecn_create_request() needlessly: lets cache
the CA ecn option flag in RTAX_FEATURES. In other words, when ECT(0)
is set on the SYN packet, set ecn_ok=1 iff route RTAX_FEATURES
contains the unexposed (internal-only) DST_FEATURE_ECN_CA. This allows
to only do a single metric feature lookup inside tcp_ecn_create_request().

Joint work with Florian Westphal.
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c3a8d947

fib, fib6: reject invalid feature bits · b8d3e416

由 Daniel Borkmann 提交于 8月 31, 2015

Feature bits that are invalid should not be accepted by the kernel,
only the lower 4 bits may be configured, but not the remaining ones.
Even from these 4, 2 of them are unused.
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b8d3e416

net: fib6: reduce identation in ip6_convert_metrics · 1bb14807

由 Daniel Borkmann 提交于 8月 31, 2015

Reduce the identation a bit, there's no need to artificically have
it increased.
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1bb14807

31 8月, 2015 1 次提交

net: Optimize snmp stat aggregation by walking all the percpu data at once · a3a77372

由 Raghavendra K T 提交于 8月 30, 2015

Docker container creation linearly increased from around 1.6 sec to 7.5 sec
(at 1000 containers) and perf data showed 50% ovehead in snmp_fold_field.

reason: currently __snmp6_fill_stats64 calls snmp_fold_field that walks
through per cpu data of an item (iteratively for around 36 items).

idea: This patch tries to aggregate the statistics by going through
all the items of each cpu sequentially which is reducing cache
misses.

Docker creation got faster by more than 2x after the patch.

Result:
                       Before           After
Docker creation time   6.836s           3.25s
cache miss             2.7%             1.41%

perf before:
    50.73%  docker           [kernel.kallsyms]       [k] snmp_fold_field
     9.07%  swapper          [kernel.kallsyms]       [k] snooze_loop
     3.49%  docker           [kernel.kallsyms]       [k] veth_stats_one
     2.85%  swapper          [kernel.kallsyms]       [k] _raw_spin_lock

perf after:
    10.57%  docker           docker                [.] scanblock
     8.37%  swapper          [kernel.kallsyms]     [k] snooze_loop
     6.91%  docker           [kernel.kallsyms]     [k] snmp_get_cpu_field
     6.67%  docker           [kernel.kallsyms]     [k] veth_stats_one

changes/ideas suggested:
Using buffer in stack (Eric), Usage of memset (David), Using memcpy in
place of unaligned_put (Joe).
Signed-off-by: NRaghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a3a77372

30 8月, 2015 2 次提交

vxlan: do not receive IPv4 packets on IPv6 socket · a43a9ef6

由 Jiri Benc 提交于 8月 28, 2015

By default (subject to the sysctl settings), IPv6 sockets listen also for
IPv4 traffic. Vxlan is not prepared for that and expects IPv6 header in
packets received through an IPv6 socket.

In addition, it's currently not possible to have both IPv4 and IPv6 vxlan
tunnel on the same port (unless bindv6only sysctl is enabled), as it's not
possible to create and bind both IPv4 and IPv6 vxlan interfaces and there's
no way to specify both IPv4 and IPv6 remote/group IP addresses.

Set IPV6_V6ONLY on vxlan sockets to fix both of these issues. This is not
done globally in udp_tunnel, as l2tp and tipc seems to work okay when
receiving IPv4 packets on IPv6 socket and people may rely on this behavior.
The other tunnels (geneve and fou) do not support IPv6.
Signed-off-by: NJiri Benc <jbenc@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a43a9ef6

ip_tunnels: convert the mode field of ip_tunnel_info to flags · 46fa062a

由 Jiri Benc 提交于 8月 28, 2015

The mode field holds a single bit of information only (whether the
ip_tunnel_info struct is for rx or tx). Change the mode field to bit flags.
This allows more mode flags to be added.
Signed-off-by: NJiri Benc <jbenc@redhat.com>
Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
Acked-by: NThomas Graf <tgraf@suug.ch>
Acked-by: NPravin B Shelar <pshelar@nicira.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

46fa062a

29 8月, 2015 2 次提交

netfilter: reduce sparse warnings · 851345c5

由 Florian Westphal 提交于 8月 28, 2015

bridge/netfilter/ebtables.c:290:26: warning: incorrect type in assignment (different modifiers)
-> remove __pure annotation.

ipv6/netfilter/ip6t_SYNPROXY.c:240:27: warning: cast from restricted __be16
-> switch ntohs to htons and vice versa.

netfilter/core.c:391:30: warning: symbol 'nfq_ct_nat_hook' was not declared. Should it be static?
-> delete it, got removed

net/netfilter/nf_synproxy_core.c:221:48: warning: cast to restricted __be32
-> Use __be32 instead of u32.

Tested with objdiff that these changes do not affect generated code.
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

851345c5

Revert "netfilter: xtables: compute exact size needed for jumpstack" · 98dbbfc3

由 Florian Westphal 提交于 8月 26, 2015

This reverts commit 98d1bd80.

mark_source_chains will not re-visit chains, so

*filter
:INPUT ACCEPT [365:25776]
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [217:45832]
:t1 - [0:0]
:t2 - [0:0]
:t3 - [0:0]
:t4 - [0:0]
-A t1 -i lo -j t2
-A t2 -i lo -j t3
-A t3 -i lo -j t4
# -A INPUT -j t4
# -A INPUT -j t3
# -A INPUT -j t2
-A INPUT -j t1
COMMIT

Will compute a chain depth of 2 if the comments are removed.
Revert back to counting the number of chains for the time being.
Reported-by: NCong Wang <cwang@twopensource.com>
Reported-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

98dbbfc3

28 8月, 2015 1 次提交

ipv6: Export nf_ct_frag6_gather() · 5b490047

由 Joe Stringer 提交于 8月 26, 2015

Signed-off-by: NJoe Stringer <joestringer@nicira.com>
Acked-by: NThomas Graf <tgraf@suug.ch>
Acked-by: NPravin B Shelar <pshelar@nicira.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5b490047

27 8月, 2015 1 次提交

netfilter: ip6t_REJECT: added missing icmpv6 codes · 1afe839e

由 Andreas Herz 提交于 8月 21, 2015

RFC 4443 added two new codes values for ICMPv6 type 1:

 5 - Source address failed ingress/egress policy
 6 - Reject route to destination

And RFC 7084 states in L-14 that IPv6 Router MUST send ICMPv6 Destination
Unreachable with code 5 for packets forwarded to it that use an address
from a prefix that has been invalidated.

Codes 5 and 6 are more informative subsets of code 1.
Signed-off-by: NAndreas Herz <andi@geekosphere.org>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

1afe839e

26 8月, 2015 3 次提交

ip6_gre: release cached dst on tunnel removal · d4257295

由 huaibin Wang 提交于 8月 25, 2015

When a tunnel is deleted, the cached dst entry should be released.

This problem may prevent the removal of a netns (seen with a x-netns IPv6
gre tunnel):
  unregister_netdevice: waiting for lo to become free. Usage count = 3

CC: Dmitry Kozlov <xeb@mail.ru>
Fixes: c12b395a ("gre: Support GRE over IPv6")
Signed-off-by: Nhuaibin Wang <huaibin.wang@6wind.com>
Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d4257295

ah6: fix error return code · 25105051

由 Julia Lawall 提交于 8月 23, 2015

Return a negative error code on failure.

A simplified version of the semantic match that finds this problem is as
follows: (http://coccinelle.lip6.fr/)

// <smpl>
@@
identifier ret; expression e1,e2;
@@
(
if (\(ret < 0\|ret != 0\))
 { ... return ret; }
|
ret = 0
)
... when != ret = e1
    when != &ret
*if(...)
{
  ... when != ret = e2
      when forall
 return ret;
}
// </smpl>
Signed-off-by: NJulia Lawall <Julia.Lawall@lip6.fr>
Acked-by: NHerbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

25105051

xfrm: Use VRF master index if output device is enslaved · 4ec3b28c

由 David Ahern 提交于 8月 20, 2015

Directs route lookups to VRF table. Compiles out if NET_VRF is not
enabled. With this patch able to successfully bring up ipsec tunnels
in VRFs, even with duplicate network configuration.
Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
Acked-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
Acked-by: NSteffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4ec3b28c

25 8月, 2015 2 次提交

ila: Precompute checksum difference for translations · 92b78aff

由 Tom Herbert 提交于 8月 24, 2015

In the ILA build state for LWT compute the checksum difference to apply
to transport checksums that include the IPv6 pseudo header. The
difference is between the route destination (from fib6_config) and the
locator to write.
Signed-off-by: NTom Herbert <tom@herbertland.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

92b78aff

lwt: Add cfg argument to build_state · 127eb7cd

由 Tom Herbert 提交于 8月 24, 2015

Add cfg and family arguments to lwt build state functions. cfg is a void
pointer and will either be a pointer to a fib_config or fib6_config
structure. The family parameter indicates which one (either AF_INET
or AF_INET6).

LWT encpasulation implementation may use the fib configuration to build
the LWT state.
Signed-off-by: NTom Herbert <tom@herbertland.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

127eb7cd

22 8月, 2015 1 次提交

netfilter: nf_dup: fix sparse warnings · 59e26423

由 Pablo Neira Ayuso 提交于 8月 21, 2015

>> net/ipv4/netfilter/nft_dup_ipv4.c:29:37: sparse: incorrect type in initializer (different base types)
   net/ipv4/netfilter/nft_dup_ipv4.c:29:37:    expected restricted __be32 [user type] s_addr
   net/ipv4/netfilter/nft_dup_ipv4.c:29:37:    got unsigned int [unsigned] <noident>

>> net/ipv6/netfilter/nf_dup_ipv6.c:48:23: sparse: incorrect type in assignment (different base types)
   net/ipv6/netfilter/nf_dup_ipv6.c:48:23:    expected restricted __be32 [addressable] [assigned] [usertype] flowlabel
   net/ipv6/netfilter/nf_dup_ipv6.c:48:23:    got int
Reported-by: Nkbuild test robot <fengguang.wu@intel.com>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

59e26423

21 8月, 2015 4 次提交

ipv6: route: extend flow representation with tunnel key · 904af04d

由 Jiri Benc 提交于 8月 20, 2015

Use flowi_tunnel in flowi6 similarly to what is done with IPv4.
This complements commit 1b7179d3 ("route: Extend flow representation
with tunnel key").
Signed-off-by: NJiri Benc <jbenc@redhat.com>
Acked-by: NThomas Graf <tgraf@suug.ch>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

904af04d

ipv6: ndisc: inherit metadata dst when creating ndisc requests · ab450605

由 Jiri Benc 提交于 8月 20, 2015

If output device wants to see the dst, inherit the dst of the original skb
in the ndisc request.

This is an IPv6 counterpart of commit 0accfc26 ("arp: Inherit metadata
dst when creating ARP requests").
Signed-off-by: NJiri Benc <jbenc@redhat.com>
Acked-by: NThomas Graf <tgraf@suug.ch>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ab450605

ipv6: drop metadata dst in ip6_route_input · 06e9d040

由 Jiri Benc 提交于 8月 20, 2015

The fix in commit 48fb6b55 is incomplete, as now ip6_route_input can be
called with non-NULL dst if it's a metadata dst and the reference is leaked.
Drop the reference.

Fixes: 48fb6b55 ("ipv6: fix crash over flow-based vxlan device")
Fixes: ee122c79 ("vxlan: Flow based tunneling")
CC: Wei-Chun Chao <weichunc@plumgrid.com>
CC: Thomas Graf <tgraf@suug.ch>
Signed-off-by: NJiri Benc <jbenc@redhat.com>
Acked-by: NThomas Graf <tgraf@suug.ch>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

06e9d040

route: move lwtunnel state to dst_entry · 61adedf3

由 Jiri Benc 提交于 8月 20, 2015

Currently, the lwtunnel state resides in per-protocol data. This is
a problem if we encapsulate ipv6 traffic in an ipv4 tunnel (or vice versa).
The xmit function of the tunnel does not know whether the packet has been
routed to it by ipv4 or ipv6, yet it needs the lwtstate data. Moving the
lwtstate data to dst_entry makes such inter-protocol tunneling possible.

As a bonus, this brings a nice diffstat.
Signed-off-by: NJiri Benc <jbenc@redhat.com>
Acked-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
Acked-by: NThomas Graf <tgraf@suug.ch>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

61adedf3

18 8月, 2015 9 次提交

net: Identifier Locator Addressing module · 65d7ab8d

由 Tom Herbert 提交于 8月 17, 2015

Adding new module name ila. This implements ILA translation. Light
weight tunnel redirection is used to perform the translation in
the data path. This is configured by the "ip -6 route" command
using the "encap ila <locator>" option, where <locator> is the
value to set in destination locator of the packet. e.g.

ip -6 route add 3333:0:0:1:5555:0:1:0/128 \
      encap ila 2001:0:0:1 via 2401:db00:20:911a:face:0:25:0

Sets a route where 3333:0:0:1 will be overwritten by
2001:0:0:1 on output.
Signed-off-by: NTom Herbert <tom@herbertland.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

65d7ab8d

net: Change pseudohdr argument of inet_proto_csum_replace* to be a bool · 4b048d6d

由 Tom Herbert 提交于 8月 17, 2015

inet_proto_csum_replace4,2,16 take a pseudohdr argument which indicates
the checksum field carries a pseudo header. This argument should be a
boolean instead of an int.
Signed-off-by: NTom Herbert <tom@herbertland.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4b048d6d

lwt: Add support to redirect dst.input · 25368623

由 Tom Herbert 提交于 8月 17, 2015

This patch adds the capability to redirect dst input in the same way
that dst output is redirected by LWT.

Also, save the original dst.input and and dst.out when setting up
lwtunnel redirection. These can be called by the client as a pass-
through.
Signed-off-by: NTom Herbert <tom@herbertland.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

25368623

netfilter: nf_conntrack: add efficient mark to zone mapping · 5e8018fc

由 Daniel Borkmann 提交于 8月 14, 2015

This work adds the possibility of deriving the zone id from the skb->mark
field in a scalable manner. This allows for having only a single template
serving hundreds/thousands of different zones, for example, instead of the
need to have one match for each zone as an extra CT jump target.

Note that we'd need to have this information attached to the template as at
the time when we're trying to lookup a possible ct object, we already need
to know zone information for a possible match when going into
__nf_conntrack_find_get(). This work provides a minimal implementation for
a possible mapping.

In order to not add/expose an extra ct->status bit, the zone structure has
been extended to carry a flag for deriving the mark.
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

5e8018fc

netfilter: nf_conntrack: add direction support for zones · deedb590

由 Daniel Borkmann 提交于 8月 14, 2015

This work adds a direction parameter to netfilter zones, so identity
separation can be performed only in original/reply or both directions
(default). This basically opens up the possibility of doing NAT with
conflicting IP address/port tuples from multiple, isolated tenants
on a host (e.g. from a netns) without requiring each tenant to NAT
twice resp. to use its own dedicated IP address to SNAT to, meaning
overlapping tuples can be made unique with the zone identifier in
original direction, where the NAT engine will then allocate a unique
tuple in the commonly shared default zone for the reply direction.
In some restricted, local DNAT cases, also port redirection could be
used for making the reply traffic unique w/o requiring SNAT.

The consensus we've reached and discussed at NFWS and since the initial
implementation [1] was to directly integrate the direction meta data
into the existing zones infrastructure, as opposed to the ct->mark
approach we proposed initially.

As we pass the nf_conntrack_zone object directly around, we don't have
to touch all call-sites, but only those, that contain equality checks
of zones. Thus, based on the current direction (original or reply),
we either return the actual id, or the default NF_CT_DEFAULT_ZONE_ID.
CT expectations are direction-agnostic entities when expectations are
being compared among themselves, so we can only use the identifier
in this case.

Note that zone identifiers can not be included into the hash mix
anymore as they don't contain a "stable" value that would be equal
for both directions at all times, f.e. if only zone->id would
unconditionally be xor'ed into the table slot hash, then replies won't
find the corresponding conntracking entry anymore.

If no particular direction is specified when configuring zones, the
behaviour is exactly as we expect currently (both directions).

Support has been added for the CT netlink interface as well as the
x_tables raw CT target, which both already offer existing interfaces
to user space for the configuration of zones.

Below a minimal, simplified collision example (script in [2]) with
netperf sessions:

  +--- tenant-1 ---+   mark := 1
  |    netperf     |--+
  +----------------+  |                CT zone := mark [ORIGINAL]
   [ip,sport] := X   +--------------+  +--- gateway ---+
                     | mark routing |--|     SNAT      |-- ... +
                     +--------------+  +---------------+       |
  +--- tenant-2 ---+  |                                     ~~~|~~~
  |    netperf     |--+                +-----------+           |
  +----------------+   mark := 2       | netserver |------ ... +
   [ip,sport] := X                     +-----------+
                                        [ip,port] := Y
On the gateway netns, example:

  iptables -t raw -A PREROUTING -j CT --zone mark --zone-dir ORIGINAL
  iptables -t nat -A POSTROUTING -o <dev> -j SNAT --to-source <ip> --random-fully

  iptables -t mangle -A PREROUTING -m conntrack --ctdir ORIGINAL -j CONNMARK --save-mark
  iptables -t mangle -A POSTROUTING -m conntrack --ctdir REPLY -j CONNMARK --restore-mark

conntrack dump from gateway netns:

  netperf -H 10.1.1.2 -t TCP_STREAM -l60 -p12865,5555 from each tenant netns

  tcp 6 431995 ESTABLISHED src=40.1.1.1 dst=10.1.1.2 sport=5555 dport=12865 zone-orig=1
                           src=10.1.1.2 dst=10.1.1.1 sport=12865 dport=1024
               [ASSURED] mark=1 secctx=system_u:object_r:unlabeled_t:s0 use=1

  tcp 6 431994 ESTABLISHED src=40.1.1.1 dst=10.1.1.2 sport=5555 dport=12865 zone-orig=2
                           src=10.1.1.2 dst=10.1.1.1 sport=12865 dport=5555
               [ASSURED] mark=2 secctx=system_u:object_r:unlabeled_t:s0 use=1

  tcp 6 299 ESTABLISHED src=40.1.1.1 dst=10.1.1.2 sport=39438 dport=33768 zone-orig=1
                        src=10.1.1.2 dst=10.1.1.1 sport=33768 dport=39438
               [ASSURED] mark=1 secctx=system_u:object_r:unlabeled_t:s0 use=1

  tcp 6 300 ESTABLISHED src=40.1.1.1 dst=10.1.1.2 sport=32889 dport=40206 zone-orig=2
                        src=10.1.1.2 dst=10.1.1.1 sport=40206 dport=32889
               [ASSURED] mark=2 secctx=system_u:object_r:unlabeled_t:s0 use=2

Taking this further, test script in [2] creates 200 tenants and runs
original-tuple colliding netperf sessions each. A conntrack -L dump in
the gateway netns also confirms 200 overlapping entries, all in ESTABLISHED
state as expected.

I also did run various other tests with some permutations of the script,
to mention some: SNAT in random/random-fully/persistent mode, no zones (no
overlaps), static zones (original, reply, both directions), etc.

  [1] http://thread.gmane.org/gmane.comp.security.firewalls.netfilter.devel/57412/
  [2] https://paste.fedoraproject.org/242835/65657871/Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

deedb590

ipv6: trivial whitespace fix · ec120da6

由 Ian Morris 提交于 8月 14, 2015

Change brace placement to be in line with coding standards
Signed-off-by: NIan Morris <ipm@chirality.org.uk>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ec120da6

ipv6: Fix a potential deadlock when creating pcpu rt · 9c7370a1

由 Martin KaFai Lau 提交于 8月 14, 2015

rt6_make_pcpu_route() is called under read_lock(&table->tb6_lock).
rt6_make_pcpu_route() calls ip6_rt_pcpu_alloc(rt) which then
calls dst_alloc().  dst_alloc() _may_ call ip6_dst_gc() which takes
the write_lock(&tabl->tb6_lock).  A visualized version:

read_lock(&table->tb6_lock);
rt6_make_pcpu_route();
=> ip6_rt_pcpu_alloc();
=> dst_alloc();
=> ip6_dst_gc();
=> write_lock(&table->tb6_lock); /* oops */

The fix is to do a read_unlock first before calling ip6_rt_pcpu_alloc().

A reported stack:

[141625.537638] INFO: rcu_sched self-detected stall on CPU { 27}  (t=60000 jiffies g=4159086 c=4159085 q=2139)
[141625.547469] Task dump for CPU 27:
[141625.550881] mtr             R  running task        0 22121  22081 0x00000008
[141625.558069]  0000000000000000 ffff88103f363d98 ffffffff8106e488 000000000000001b
[141625.565641]  ffffffff81684900 ffff88103f363db8 ffffffff810702b0 0000000008000000
[141625.573220]  ffffffff81684900 ffff88103f363de8 ffffffff8108df9f ffff88103f375a00
[141625.580803] Call Trace:
[141625.583345]  <IRQ>  [<ffffffff8106e488>] sched_show_task+0xc1/0xc6
[141625.589650]  [<ffffffff810702b0>] dump_cpu_task+0x35/0x39
[141625.595144]  [<ffffffff8108df9f>] rcu_dump_cpu_stacks+0x6a/0x8c
[141625.601320]  [<ffffffff81090606>] rcu_check_callbacks+0x1f6/0x5d4
[141625.607669]  [<ffffffff810940c8>] update_process_times+0x2a/0x4f
[141625.613925]  [<ffffffff8109fbee>] tick_sched_handle+0x32/0x3e
[141625.619923]  [<ffffffff8109fc2f>] tick_sched_timer+0x35/0x5c
[141625.625830]  [<ffffffff81094a1f>] __hrtimer_run_queues+0x8f/0x18d
[141625.632171]  [<ffffffff81094c9e>] hrtimer_interrupt+0xa0/0x166
[141625.638258]  [<ffffffff8102bf2a>] local_apic_timer_interrupt+0x4e/0x52
[141625.645036]  [<ffffffff8102c36f>] smp_apic_timer_interrupt+0x39/0x4a
[141625.651643]  [<ffffffff8140b9e8>] apic_timer_interrupt+0x68/0x70
[141625.657895]  <EOI>  [<ffffffff81346ee8>] ? dst_destroy+0x7c/0xb5
[141625.664188]  [<ffffffff813d45b5>] ? fib6_flush_trees+0x20/0x20
[141625.670272]  [<ffffffff81082b45>] ? queue_write_lock_slowpath+0x60/0x6f
[141625.677140]  [<ffffffff8140aa33>] _raw_write_lock_bh+0x23/0x25
[141625.683218]  [<ffffffff813d4553>] __fib6_clean_all+0x40/0x82
[141625.689124]  [<ffffffff813d45b5>] ? fib6_flush_trees+0x20/0x20
[141625.695207]  [<ffffffff813d6058>] fib6_clean_all+0xe/0x10
[141625.700854]  [<ffffffff813d60d3>] fib6_run_gc+0x79/0xc8
[141625.706329]  [<ffffffff813d0510>] ip6_dst_gc+0x85/0xf9
[141625.711718]  [<ffffffff81346d68>] dst_alloc+0x55/0x159
[141625.717105]  [<ffffffff813d09b5>] __ip6_dst_alloc.isra.32+0x19/0x63
[141625.723620]  [<ffffffff813d1830>] ip6_pol_route+0x36a/0x3e8
[141625.729441]  [<ffffffff813d18d6>] ip6_pol_route_output+0x11/0x13
[141625.735700]  [<ffffffff813f02c8>] fib6_rule_action+0xa7/0x1bf
[141625.741698]  [<ffffffff813d18c5>] ? ip6_pol_route_input+0x17/0x17
[141625.748043]  [<ffffffff81357c48>] fib_rules_lookup+0xb5/0x12a
[141625.754050]  [<ffffffff81141628>] ? poll_select_copy_remaining+0xf9/0xf9
[141625.761002]  [<ffffffff813f0535>] fib6_rule_lookup+0x37/0x5c
[141625.766914]  [<ffffffff813d18c5>] ? ip6_pol_route_input+0x17/0x17
[141625.773260]  [<ffffffff813d008c>] ip6_route_output+0x7a/0x82
[141625.779177]  [<ffffffff813c44c8>] ip6_dst_lookup_tail+0x53/0x112
[141625.785437]  [<ffffffff813c45c3>] ip6_dst_lookup_flow+0x2a/0x6b
[141625.791604]  [<ffffffff813ddaab>] rawv6_sendmsg+0x407/0x9b6
[141625.797423]  [<ffffffff813d7914>] ? do_ipv6_setsockopt.isra.8+0xd87/0xde2
[141625.804464]  [<ffffffff8139d4b4>] inet_sendmsg+0x57/0x8e
[141625.810028]  [<ffffffff81329ba3>] sock_sendmsg+0x2e/0x3c
[141625.815588]  [<ffffffff8132be57>] SyS_sendto+0xfe/0x143
[141625.821063]  [<ffffffff813dd551>] ? rawv6_setsockopt+0x5e/0x67
[141625.827146]  [<ffffffff8132c9f8>] ? sock_common_setsockopt+0xf/0x11
[141625.833660]  [<ffffffff8132c08c>] ? SyS_setsockopt+0x81/0xa2
[141625.839565]  [<ffffffff8140ac17>] entry_SYSCALL_64_fastpath+0x12/0x6a

Fixes: d52d3997 ("pv6: Create percpu rt6_info")
Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
CC: Hannes Frederic Sowa <hannes@stressinduktion.org>
Reported-by: NSteinar H. Gunderson <sgunderson@bigfoot.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9c7370a1

ipv6: Add rt6_make_pcpu_route() · a73e4195

由 Martin KaFai Lau 提交于 8月 14, 2015

It is a prep work for fixing a potential deadlock when creating
a pcpu rt.

The current rt6_get_pcpu_route() will also create a pcpu rt if one does not
exist.  This patch moves the pcpu rt creation logic into another function,
rt6_make_pcpu_route().
Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
CC: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a73e4195

ipv6: Remove un-used argument from ip6_dst_alloc() · ad706862

由 Martin KaFai Lau 提交于 8月 14, 2015

After 4b32b5ad ("ipv6: Stop rt6_info from using inet_peer's metrics"),
ip6_dst_alloc() does not need the 'table' argument.  This patch
cleans it up.
Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
CC: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ad706862

14 8月, 2015 3 次提交

net: addr IFLA_OPERSTATE to netlink message for ipv6 ifinfo · 0344338b

由 Andy Gospodarek 提交于 8月 13, 2015

This is useful information to include in ipv6 netlink messages that
report interface information. IFLA_OPERSTATE is already included in
ipv4 messages, but missing for ipv6. This closes that gap.
Signed-off-by: NAndy Gospodarek <gospo@cumulusnetworks.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0344338b

net: ipv6 sysctl option to ignore routes when nexthop link is down · 35103d11

由 Andy Gospodarek 提交于 8月 13, 2015

Like the ipv4 patch with a similar title, this adds a sysctl to allow
the user to change routing behavior based on whether or not the
interface associated with the nexthop was an up or down link.  The
default setting preserves the current behavior, but anyone that enables
it will notice that nexthops on down interfaces will no longer be
selected:

net.ipv6.conf.all.ignore_routes_with_linkdown = 0
net.ipv6.conf.default.ignore_routes_with_linkdown = 0
net.ipv6.conf.lo.ignore_routes_with_linkdown = 0
...

When the above sysctls are set, not only will link status be reported to
userspace, but an indication that a nexthop is dead and will not be used
is also reported.

1000::/8 via 7000::2 dev p7p1  metric 1024 dead linkdown  pref medium
1000::/8 via 8000::2 dev p8p1  metric 1024  pref medium
7000::/8 dev p7p1  proto kernel  metric 256 dead linkdown  pref medium
8000::/8 dev p8p1  proto kernel  metric 256  pref medium
9000::/8 via 8000::2 dev p8p1  metric 2048  pref medium
9000::/8 via 7000::2 dev p7p1  metric 1024 dead linkdown  pref medium
fe80::/64 dev p7p1  proto kernel  metric 256 dead linkdown  pref medium
fe80::/64 dev p8p1  proto kernel  metric 256  pref medium

This also adds devconf support and notification when sysctl values
change.

v2: drop use of rt6i_nhflags since it is not needed right now
Signed-off-by: NAndy Gospodarek <gospo@cumulusnetworks.com>
Signed-off-by: NDinesh Dutt <ddutt@cumulusnetworks.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

35103d11

net: track link status of ipv6 nexthops · cea45e20

由 Andy Gospodarek 提交于 8月 13, 2015

Add support to track current link status of ipv6 nexthops to match
recent changes that added support for ipv4 nexthops.  This takes a
simple approach to track linkdown status for next-hops and simply
checks the dev for the dst entry and sets proper flags that to be used
in the netlink message.

v2: drop use of rt6i_nhflags since it is not needed right now
Signed-off-by: NAndy Gospodarek <gospo@cumulusnetworks.com>
Signed-off-by: NDinesh Dutt <ddutt@cumulusnetworks.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

cea45e20

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功