提交 · c4c877b2732466b4c63217baad05c96f775912c7 · openeuler / Kernel

11 3月, 2021 1 次提交

net: Consolidate common blackhole dst ops · c4c877b2

由 Daniel Borkmann 提交于 3月 10, 2021

Move generic blackhole dst ops to the core and use them from both
ipv4_dst_blackhole_ops and ip6_dst_blackhole_ops where possible. No
functional change otherwise. We need these also in other locations
and having to define them over and over again is not great.
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c4c877b2

09 2月, 2021 2 次提交

IPv6: Extend 'fib_notify_on_flag_change' sysctl · 6fad361a

由 Amit Cohen 提交于 2月 07, 2021

Add the value '2' to 'fib_notify_on_flag_change' to allow sending
notifications only for failed route installation.

Separate value is added for such notifications because there are less of
them, so they do not impact performance and some users will find them more
important.
Signed-off-by: NAmit Cohen <amcohen@nvidia.com>
Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6fad361a

IPv6: Add "offload failed" indication to routes · 0c5fcf9e

由 Amit Cohen 提交于 2月 07, 2021

After installing a route to the kernel, user space receives an
acknowledgment, which means the route was installed in the kernel, but not
necessarily in hardware.

The asynchronous nature of route installation in hardware can lead to a
routing daemon advertising a route before it was actually installed in
hardware. This can result in packet loss or mis-routed packets until the
route is installed in hardware.

To avoid such cases, previous patch set added the ability to emit
RTM_NEWROUTE notifications whenever RTM_F_OFFLOAD/RTM_F_TRAP flags
are changed, this behavior is controlled by sysctl.

With the above mentioned behavior, it is possible to know from user-space
if the route was offloaded, but if the offload fails there is no indication
to user-space. Following a failure, a routing daemon will wait indefinitely
for a notification that will never come.

This patch adds an "offload_failed" indication to IPv6 routes, so that
users will have better visibility into the offload process.

'struct fib6_info' is extended with new field that indicates if route
offload failed. Note that the new field is added using unused bit and
therefore there is no need to increase struct size.
Signed-off-by: NAmit Cohen <amcohen@nvidia.com>
Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0c5fcf9e

05 2月, 2021 1 次提交

net: fix building errors on powerpc when CONFIG_RETPOLINE is not set · 9c97921a

由 Brian Vazquez 提交于 2月 04, 2021

This commit fixes the errores reported when building for powerpc:

ERROR: modpost: "ip6_dst_check" [vmlinux] is a static EXPORT_SYMBOL
ERROR: modpost: "ipv4_dst_check" [vmlinux] is a static EXPORT_SYMBOL
ERROR: modpost: "ipv4_mtu" [vmlinux] is a static EXPORT_SYMBOL
ERROR: modpost: "ip6_mtu" [vmlinux] is a static EXPORT_SYMBOL

Fixes: f67fbeae ("net: use indirect call helpers for dst_mtu")
Fixes: bbd807df ("net: indirect call helpers for ipv4/ipv6 dst_check functions")
Reported-by: NStephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: NBrian Vazquez <brianvv@google.com>
Link: https://lore.kernel.org/r/20210204181839.558951-2-brianvv@google.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

9c97921a

04 2月, 2021 2 次提交

net: indirect call helpers for ipv4/ipv6 dst_check functions · bbd807df

由 Brian Vazquez 提交于 2月 01, 2021

This patch avoids the indirect call for the common case:
ip6_dst_check and ipv4_dst_check
Signed-off-by: NBrian Vazquez <brianvv@google.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

bbd807df

net: use indirect call helpers for dst_mtu · f67fbeae

由 Brian Vazquez 提交于 2月 01, 2021

This patch avoids the indirect call for the common case:
ip6_mtu and ipv4_mtu
Signed-off-by: NBrian Vazquez <brianvv@google.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

f67fbeae

03 2月, 2021 1 次提交

net: ipv6: Emit notification when fib hardware flags are changed · 907eea48

由 Amit Cohen 提交于 2月 01, 2021

After installing a route to the kernel, user space receives an
acknowledgment, which means the route was installed in the kernel,
but not necessarily in hardware.

The asynchronous nature of route installation in hardware can lead
to a routing daemon advertising a route before it was actually installed in
hardware. This can result in packet loss or mis-routed packets until the
route is installed in hardware.

It is also possible for a route already installed in hardware to change
its action and therefore its flags. For example, a host route that is
trapping packets can be "promoted" to perform decapsulation following
the installation of an IPinIP/VXLAN tunnel.

Emit RTM_NEWROUTE notifications whenever RTM_F_OFFLOAD/RTM_F_TRAP flags
are changed. The aim is to provide an indication to user-space
(e.g., routing daemons) about the state of the route in hardware.

Introduce a sysctl that controls this behavior.

Keep the default value at 0 (i.e., do not emit notifications) for several
reasons:
- Multiple RTM_NEWROUTE notification per-route might confuse existing
  routing daemons.
- Convergence reasons in routing daemons.
- The extra notifications will negatively impact the insertion rate.
- Not all users are interested in these notifications.

Move fib6_info_hw_flags_set() to C file because it is no longer a short
function.
Signed-off-by: NAmit Cohen <amcohen@nvidia.com>
Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
Reviewed-by: NDavid Ahern <dsahern@kernel.org>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

907eea48

27 1月, 2021 1 次提交

net: allow user to set metric on default route learned via Router Advertisement · 6b2e04bc

由 Praveen Chaudhary 提交于 1月 25, 2021

For IPv4, default route is learned via DHCPv4 and user is allowed to change
metric using config etc/network/interfaces. But for IPv6, default route can
be learned via RA, for which, currently a fixed metric value 1024 is used.

Ideally, user should be able to configure metric on default route for IPv6
similar to IPv4. This patch adds sysctl for the same.

Logs:

For IPv4:

Config in etc/network/interfaces:
auto eth0
iface eth0 inet dhcp
    metric 4261413864

IPv4 Kernel Route Table:
$ ip route list
default via 172.21.47.1 dev eth0 metric 4261413864

FRR Table, if a static route is configured:
[In real scenario, it is useful to prefer BGP learned default route over DHCPv4 default route.]
Codes: K - kernel route, C - connected, S - static, R - RIP,
       O - OSPF, I - IS-IS, B - BGP, P - PIM, E - EIGRP, N - NHRP,
       T - Table, v - VNC, V - VNC-Direct, A - Babel, D - SHARP,
       > - selected route, * - FIB route

S>* 0.0.0.0/0 [20/0] is directly connected, eth0, 00:00:03
K   0.0.0.0/0 [254/1000] via 172.21.47.1, eth0, 6d08h51m

i.e. User can prefer Default Router learned via Routing Protocol in IPv4.
Similar behavior is not possible for IPv6, without this fix.

After fix [for IPv6]:
sudo sysctl -w net.ipv6.conf.eth0.net.ipv6.conf.eth0.ra_defrtr_metric=1996489705

IP monitor: [When IPv6 RA is received]
default via fe80::xx16:xxxx:feb3:ce8e dev eth0 proto ra metric 1996489705  pref high

Kernel IPv6 routing table
$ ip -6 route list
default via fe80::be16:65ff:feb3:ce8e dev eth0 proto ra metric 1996489705 expires 21sec hoplimit 64 pref high

FRR Table, if a static route is configured:
[In real scenario, it is useful to prefer BGP learned default route over IPv6 RA default route.]
Codes: K - kernel route, C - connected, S - static, R - RIPng,
       O - OSPFv3, I - IS-IS, B - BGP, N - NHRP, T - Table,
       v - VNC, V - VNC-Direct, A - Babel, D - SHARP,
       > - selected route, * - FIB route

S>* ::/0 [20/0] is directly connected, eth0, 00:00:06
K   ::/0 [119/1001] via fe80::xx16:xxxx:feb3:ce8e, eth0, 6d07h43m

If the metric is changed later, the effect will be seen only when next IPv6
RA is received, because the default route must be fully controlled by RA msg.
Below metric is changed from 1996489705 to 1996489704.

$ sudo sysctl -w net.ipv6.conf.eth0.ra_defrtr_metric=1996489704
net.ipv6.conf.eth0.ra_defrtr_metric = 1996489704

IP monitor:
[On next IPv6 RA msg, Kernel deletes prev route and installs new route with updated metric]

Deleted default via fe80::xx16:xxxx:feb3:ce8e dev eth0 proto ra metric 1996489705 expires 3sec hoplimit 64 pref high
default via fe80::xx16:xxxx:feb3:ce8e dev eth0 proto ra metric 1996489704 pref high
Signed-off-by: NPraveen Chaudhary <pchaudhary@linkedin.com>
Signed-off-by: NZhenggen Xu <zxu@linkedin.com>
Reviewed-by: NDavid Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20210125214430.24079-1-pchaudhary@linkedin.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

6b2e04bc

20 11月, 2020 1 次提交

IPv6: RTM_GETROUTE: Add RTA_ENCAP to result · 6b13d8f7

由 Oliver Herms 提交于 11月 19, 2020

This patch adds an IPv6 routes encapsulation attribute
to the result of netlink RTM_GETROUTE requests
(i.e. ip route get 2001:db8::).
Signed-off-by: NOliver Herms <oliver.peter.herms@gmail.com>
Reviewed-by: NDavid Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20201118230651.GA8861@twsSigned-off-by: NJakub Kicinski <kuba@kernel.org>

6b13d8f7

07 11月, 2020 1 次提交

nexthop: Remove in-kernel route notifications when nexthop changes · bbea126c

由 Ido Schimmel 提交于 11月 04, 2020

Remove in-kernel route notifications when the configuration of their
nexthop changes.

These notifications are unnecessary because the route still uses the
same nexthop ID. A separate notification for the nexthop change itself
is now sent in the nexthop notification chain.
Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
Reviewed-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

bbea126c

10 10月, 2020 1 次提交

net: ipv6: Discard next-hop MTU less than minimum link MTU · 4a65dff8

由 Georg Kohmann 提交于 10月 07, 2020

When a ICMPV6_PKT_TOOBIG report a next-hop MTU that is less than the IPv6
minimum link MTU, the estimated path MTU is reduced to the minimum link
MTU. This behaviour breaks TAHI IPv6 Core Conformance Test v6LC4.1.6:
Packet Too Big Less than IPv6 MTU.

Referring to RFC 8201 section 4: "If a node receives a Packet Too Big
message reporting a next-hop MTU that is less than the IPv6 minimum link
MTU, it must discard it. A node must not reduce its estimate of the Path
MTU below the IPv6 minimum link MTU on receipt of a Packet Too Big
message."

Drop the path MTU update if reported MTU is less than the minimum link MTU.
Signed-off-by: NGeorg Kohmann <geokohma@cisco.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

4a65dff8

22 9月, 2020 1 次提交

ipv6: route: convert comma to semicolon · 91b2c9a0

由 Xu Wang 提交于 9月 21, 2020

Replace a comma between expression statements by a semicolon.
Signed-off-by: NXu Wang <vulab@iscas.ac.cn>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

91b2c9a0

12 9月, 2020 1 次提交

ipv6: remove redundant assignment to variable err · 2291267e

由 Colin Ian King 提交于 9月 11, 2020

The variable err is being initialized with a value that is never read and
it is being updated later with a new value. The initialization is redundant
and can be removed.  Also re-order variable declarations in reverse
Christmas tree ordering.

Addresses-Coverity: ("Unused value")
Signed-off-by: NColin Ian King <colin.king@canonical.com>
Reviewed-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2291267e

29 7月, 2020 1 次提交

ipv6: Fix nexthop refcnt leak when creating ipv6 route info · 706ec919

由 Xiyu Yang 提交于 7月 25, 2020

ip6_route_info_create() invokes nexthop_get(), which increases the
refcount of the "nh".

When ip6_route_info_create() returns, local variable "nh" becomes
invalid, so the refcount should be decreased to keep refcount balanced.

The reference counting issue happens in one exception handling path of
ip6_route_info_create(). When nexthops can not be used with source
routing, the function forgets to decrease the refcnt increased by
nexthop_get(), causing a refcnt leak.

Fix this issue by pulling up the error source routing handling when
nexthops can not be used with source routing.

Fixes: f88d8ea6 ("ipv6: Plumb support for nexthop object in a fib6_info")
Signed-off-by: NXiyu Yang <xiyuyang19@fudan.edu.cn>
Signed-off-by: NXin Tan <tanxin.ctf@gmail.com>
Reviewed-by: NDavid Ahern <dsahern@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

706ec919

26 7月, 2020 1 次提交

bpf: Refactor bpf_iter_reg to have separate seq_info member · 14fc6bd6

由 Yonghong Song 提交于 7月 23, 2020

There is no functionality change for this patch.
Struct bpf_iter_reg is used to register a bpf_iter target,
which includes information for both prog_load, link_create
and seq_file creation.

This patch puts fields related seq_file creation into
a different structure. This will be useful for map
elements iterator where one iterator covers different
map types and different map types may have different
seq_ops, init/fini private_data function and
private_data size.
Signed-off-by: NYonghong Song <yhs@fb.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200723184109.590030-1-yhs@fb.com

14fc6bd6

22 7月, 2020 1 次提交

bpf: net: Use precomputed btf_id for bpf iterators · 951cf368

由 Yonghong Song 提交于 7月 20, 2020

One additional field btf_id is added to struct
bpf_ctx_arg_aux to store the precomputed btf_ids.
The btf_id is computed at build time with
BTF_ID_LIST or BTF_ID_LIST_GLOBAL macro definitions.
All existing bpf iterators are changed to used
pre-compute btf_ids.
Signed-off-by: NYonghong Song <yhs@fb.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200720163403.1393551-1-yhs@fb.com

951cf368

08 7月, 2020 1 次提交

ipv6: Fix use of anycast address with loopback · aea23c32

由 David Ahern 提交于 7月 07, 2020

Thomas reported a regression with IPv6 and anycast using the following
reproducer:

    echo 1 >  /proc/sys/net/ipv6/conf/all/forwarding
    ip -6 a add fc12::1/16 dev lo
    sleep 2
    echo "pinging lo"
    ping6 -c 2 fc12::

The conversion of addrconf_f6i_alloc to use ip6_route_info_create missed
the use of fib6_is_reject which checks addresses added to the loopback
interface and sets the REJECT flag as needed. Update fib6_is_reject for
loopback checks to handle RTF_ANYCAST addresses.

Fixes: c7a1ce39 ("ipv6: Change addrconf_f6i_alloc to use ip6_route_info_create")
Reported-by: thomas.gambier@nexedi.com
Signed-off-by: NDavid Ahern <dsahern@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

aea23c32

07 7月, 2020 1 次提交

ipv6: fib6_select_path can not use out path for nexthop objects · 34fe5a1c

由 David Ahern 提交于 7月 06, 2020

Brian reported a crash in IPv6 code when using rpfilter with a setup
running FRR and external nexthop objects. The root cause of the crash
is fib6_select_path setting fib6_nh in the result to NULL because of
an improper check for nexthop objects.

More specifically, rpfilter invokes ip6_route_lookup with flowi6_oif
set causing fib6_select_path to be called with have_oif_match set.
fib6_select_path has early check on have_oif_match and jumps to the
out label which presumes a builtin fib6_nh. This path is invalid for
nexthop objects; for external nexthops fib6_select_path needs to just
return if the fib6_nh has already been set in the result otherwise it
returns after the call to nexthop_path_fib6_result. Update the check
on have_oif_match to not bail on external nexthops.

Update selftests for this problem.

Fixes: f88d8ea6 ("ipv6: Plumb support for nexthop object in a fib6_info")
Reported-by: NBrian Rak <brak@choopa.com>
Signed-off-by: NDavid Ahern <dsahern@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

34fe5a1c

24 6月, 2020 1 次提交

ipv6: fib6: avoid indirect calls from fib6_rule_lookup · 55cced4f

由 Brian Vazquez 提交于 6月 23, 2020

It was reported that a considerable amount of cycles were spent on the
expensive indirect calls on fib6_rule_lookup. This patch introduces an
inline helper called pol_route_func that uses the indirect_call_wrappers
to avoid the indirect calls.

This patch saves around 50ns per call.

Performance was measured on the receiver by checking the amount of
syncookies that server was able to generate under a synflood load.

Traffic was generated using trafgen[1] which was pushing around 1Mpps on
a single queue. Receiver was using only one rx queue which help to
create a bottle neck and make the experiment rx-bounded.

These are the syncookies generated over 10s from the different runs:

Whithout the patch:
TcpExtSyncookiesSent            3553749            0.0
TcpExtSyncookiesSent            3550895            0.0
TcpExtSyncookiesSent            3553845            0.0
TcpExtSyncookiesSent            3541050            0.0
TcpExtSyncookiesSent            3539921            0.0
TcpExtSyncookiesSent            3557659            0.0
TcpExtSyncookiesSent            3526812            0.0
TcpExtSyncookiesSent            3536121            0.0
TcpExtSyncookiesSent            3529963            0.0
TcpExtSyncookiesSent            3536319            0.0

With the patch:
TcpExtSyncookiesSent            3611786            0.0
TcpExtSyncookiesSent            3596682            0.0
TcpExtSyncookiesSent            3606878            0.0
TcpExtSyncookiesSent            3599564            0.0
TcpExtSyncookiesSent            3601304            0.0
TcpExtSyncookiesSent            3609249            0.0
TcpExtSyncookiesSent            3617437            0.0
TcpExtSyncookiesSent            3608765            0.0
TcpExtSyncookiesSent            3620205            0.0
TcpExtSyncookiesSent            3601895            0.0

Without the patch the average is 354263 pkt/s or 2822 ns/pkt and with
the patch the average is 360738 pkt/s or 2772 ns/pkt which gives an
estimate of 50 ns per packet.

[1] http://netsniff-ng.org/

Changelog since v1:
 - Change ordering in the ICW (Paolo Abeni)

Cc: Luigi Rizzo <lrizzo@google.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Reported-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NBrian Vazquez <brianvv@google.com>
Acked-by: NPaolo Abeni <pabeni@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

55cced4f

23 5月, 2020 1 次提交

nexthop: support for fdb ecmp nexthops · 38428d68

由 Roopa Prabhu 提交于 5月 21, 2020

This patch introduces ecmp nexthops and nexthop groups
for mac fdb entries. In subsequent patches this is used
by the vxlan driver fdb entries. The use case is
E-VPN multihoming [1,2,3] which requires bridged vxlan traffic
to be load balanced to remote switches (vteps) belonging to
the same multi-homed ethernet segment (This is analogous to
a multi-homed LAG but over vxlan).

Changes include new nexthop flag NHA_FDB for nexthops
referenced by fdb entries. These nexthops only have ip.
This patch includes appropriate checks to avoid routes
referencing such nexthops.

example:
$ip nexthop add id 12 via 172.16.1.2 fdb
$ip nexthop add id 13 via 172.16.1.3 fdb
$ip nexthop add id 102 group 12/13 fdb

$bridge fdb add 02:02:00:00:00:13 dev vxlan1000 nhid 101 self

[1] E-VPN https://tools.ietf.org/html/rfc7432
[2] E-VPN VxLAN: https://tools.ietf.org/html/rfc8365
[3] LPC talk with mention of nexthop groups for L2 ecmp
http://vger.kernel.org/lpc_net2018_talks/scaling_bridge_fdb_database_slidesV3.pdf

v4 - fixed uninitialized variable reported by kernel test robot
Reported-by: Nkernel test robot <rong.a.chen@intel.com>
Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
Reviewed-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

38428d68

19 5月, 2020 1 次提交

ipv6: lift copy_from_user out of ipv6_route_ioctl · 7c1552da

由 Christoph Hellwig 提交于 5月 18, 2020

Prepare for better compat ioctl handling by moving the user copy out
of ipv6_route_ioctl.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7c1552da

14 5月, 2020 3 次提交

bpf: Enable bpf_iter targets registering ctx argument types · 3c32cc1b

由 Yonghong Song 提交于 5月 13, 2020

Commit b121b341 ("bpf: Add PTR_TO_BTF_ID_OR_NULL
support") adds a field btf_id_or_null_non0_off to
bpf_prog->aux structure to indicate that the
first ctx argument is PTR_TO_BTF_ID reg_type and
all others are PTR_TO_BTF_ID_OR_NULL.
This approach does not really scale if we have
other different reg types in the future, e.g.,
a pointer to a buffer.

This patch enables bpf_iter targets registering ctx argument
reg types which may be different from the default one.
For example, for pointers to structures, the default reg_type
is PTR_TO_BTF_ID for tracing program. The target can register
a particular pointer type as PTR_TO_BTF_ID_OR_NULL which can
be used by the verifier to enforce accesses.
Signed-off-by: NYonghong Song <yhs@fb.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Acked-by: NAndrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200513180221.2949882-1-yhs@fb.com

3c32cc1b

bpf: Change func bpf_iter_unreg_target() signature · ab2ee4fc

由 Yonghong Song 提交于 5月 13, 2020

Change func bpf_iter_unreg_target() parameter from target
name to target reg_info, similar to bpf_iter_reg_target().
Signed-off-by: NYonghong Song <yhs@fb.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Acked-by: NAndrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200513180220.2949737-1-yhs@fb.com

ab2ee4fc

bpf: net: Refactor bpf_iter target registration · 15172a46

由 Yonghong Song 提交于 5月 13, 2020

Currently bpf_iter_reg_target takes parameters from target
and allocates memory to save them. This is really not
necessary, esp. in the future we may grow information
passed from targets to bpf_iter manager.

The patch refactors the code so target reg_info
becomes static and bpf_iter manager can just take
a reference to it.
Signed-off-by: NYonghong Song <yhs@fb.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200513180219.2949605-1-yhs@fb.com

15172a46

10 5月, 2020 1 次提交

net: bpf: Add netlink and ipv6_route bpf_iter targets · 138d0be3

由 Yonghong Song 提交于 5月 09, 2020

This patch added netlink and ipv6_route targets, using
the same seq_ops (except show() and minor changes for stop())
for /proc/net/{netlink,ipv6_route}.

The net namespace for these targets are the current net
namespace at file open stage, similar to
/proc/net/{netlink,ipv6_route} reference counting
the net namespace at seq_file open stage.

Since module is not supported for now, ipv6_route is
supported only if the IPV6 is built-in, i.e., not compiled
as a module. The restriction can be lifted once module
is properly supported for bpf_iter.
Signed-off-by: NYonghong Song <yhs@fb.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Acked-by: NAndrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200509175910.2476329-1-yhs@fb.com

138d0be3

09 5月, 2020 2 次提交

ipv6: use DST_NOCOUNT in ip6_rt_pcpu_alloc() · d8882935

由 Eric Dumazet 提交于 5月 08, 2020

We currently have to adjust ipv6 route gc_thresh/max_size depending
on number of cpus on a server, this makes very little sense.

If the kernels sets /proc/sys/net/ipv6/route/gc_thresh to 1024
and /proc/sys/net/ipv6/route/max_size to 4096, then we better
not track the percpu dst that our implementation uses.

Only routes not added (directly or indirectly) by the admin
should be tracked and limited.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Martin KaFai Lau <kafai@fb.com>
Cc: David Ahern <dsahern@kernel.org>
Cc: Maciej Żenczykowski <maze@google.com>
Acked-by: NWei Wang <weiwan@google.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

d8882935

net/dst: use a smaller percpu_counter batch for dst entries accounting · cf86a086

由 Eric Dumazet 提交于 5月 07, 2020

percpu_counter_add() uses a default batch size which is quite big
on platforms with 256 cpus. (2*256 -> 512)

This means dst_entries_get_fast() can be off by +/- 2*(nr_cpus^2)
(131072 on servers with 256 cpus)

Reduce the batch size to something more reasonable, and
add logic to ip6_dst_gc() to call dst_entries_get_slow()
before calling the _very_ expensive fib6_run_gc() function.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

cf86a086

08 5月, 2020 1 次提交

Revert "ipv6: add mtu lock check in __ip6_rt_update_pmtu" · 09454fd0

由 Maciej Żenczykowski 提交于 5月 05, 2020

This reverts commit 19bda36c:

| ipv6: add mtu lock check in __ip6_rt_update_pmtu
|
| Prior to this patch, ipv6 didn't do mtu lock check in ip6_update_pmtu.
| It leaded to that mtu lock doesn't really work when receiving the pkt
| of ICMPV6_PKT_TOOBIG.
|
| This patch is to add mtu lock check in __ip6_rt_update_pmtu just as ipv4
| did in __ip_rt_update_pmtu.

The above reasoning is incorrect.  IPv6 *requires* icmp based pmtu to work.
There's already a comment to this effect elsewhere in the kernel:

  $ git grep -p -B1 -A3 'RTAX_MTU lock'
  net/ipv6/route.c=4813=

  static int rt6_mtu_change_route(struct fib6_info *f6i, void *p_arg)
  ...
    /* In IPv6 pmtu discovery is not optional,
       so that RTAX_MTU lock cannot disable it.
       We still use this lock to block changes
       caused by addrconf/ndisc.
    */

This reverts to the pre-4.9 behaviour.

Cc: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Xin Long <lucien.xin@gmail.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NMaciej Żenczykowski <maze@google.com>
Fixes: 19bda36c ("ipv6: add mtu lock check in __ip6_rt_update_pmtu")
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

09454fd0

02 5月, 2020 1 次提交

ipv6: Use global sernum for dst validation with nexthop objects · 8f34e53b

由 David Ahern 提交于 5月 01, 2020

Nik reported a bug with pcpu dst cache when nexthop objects are
used illustrated by the following:
    $ ip netns add foo
    $ ip -netns foo li set lo up
    $ ip -netns foo addr add 2001:db8:11::1/128 dev lo
    $ ip netns exec foo sysctl net.ipv6.conf.all.forwarding=1
    $ ip li add veth1 type veth peer name veth2
    $ ip li set veth1 up
    $ ip addr add 2001:db8:10::1/64 dev veth1
    $ ip li set dev veth2 netns foo
    $ ip -netns foo li set veth2 up
    $ ip -netns foo addr add 2001:db8:10::2/64 dev veth2
    $ ip -6 nexthop add id 100 via 2001:db8:10::2 dev veth1
    $ ip -6 route add 2001:db8:11::1/128 nhid 100

    Create a pcpu entry on cpu 0:
    $ taskset -a -c 0 ip -6 route get 2001:db8:11::1

    Re-add the route entry:
    $ ip -6 ro del 2001:db8:11::1
    $ ip -6 route add 2001:db8:11::1/128 nhid 100

    Route get on cpu 0 returns the stale pcpu:
    $ taskset -a -c 0 ip -6 route get 2001:db8:11::1
    RTNETLINK answers: Network is unreachable

    While cpu 1 works:
    $ taskset -a -c 1 ip -6 route get 2001:db8:11::1
    2001:db8:11::1 from :: via 2001:db8:10::2 dev veth1 src 2001:db8:10::1 metric 1024 pref medium

Conversion of FIB entries to work with external nexthop objects
missed an important difference between IPv4 and IPv6 - how dst
entries are invalidated when the FIB changes. IPv4 has a per-network
namespace generation id (rt_genid) that is bumped on changes to the FIB.
Checking if a dst_entry is still valid means comparing rt_genid in the
rtable to the current value of rt_genid for the namespace.

IPv6 also has a per network namespace counter, fib6_sernum, but the
count is saved per fib6_node. With the per-node counter only dst_entries
based on fib entries under the node are invalidated when changes are
made to the routes - limiting the scope of invalidations. IPv6 uses a
reference in the rt6_info, 'from', to track the corresponding fib entry
used to create the dst_entry. When validating a dst_entry, the 'from'
is used to backtrack to the fib6_node and check the sernum of it to the
cookie passed to the dst_check operation.

With the inline format (nexthop definition inline with the fib6_info),
dst_entries cached in the fib6_nh have a 1:1 correlation between fib
entries, nexthop data and dst_entries. With external nexthops, IPv6
looks more like IPv4 which means multiple fib entries across disparate
fib6_nodes can all reference the same fib6_nh. That means validation
of dst_entries based on external nexthops needs to use the IPv4 format
- the per-network namespace counter.

Add sernum to rt6_info and set it when creating a pcpu dst entry. Update
rt6_get_cookie to return sernum if it is set and update dst_check for
IPv6 to look for sernum set and based the check on it if so. Finally,
rt6_get_pcpu_route needs to validate the cached entry before returning
a pcpu entry (similar to the rt_cache_valid calls in __mkroute_input and
__mkroute_output for IPv4).

This problem only affects routes using the new, external nexthops.

Thanks to the kbuild test robot for catching the IS_ENABLED needed
around rt_genid_ipv6 before I sent this out.

Fixes: 5b98324e ("ipv6: Allow routes to use nexthop objects")
Reported-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: NDavid Ahern <dsahern@kernel.org>
Reviewed-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
Tested-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8f34e53b

29 4月, 2020 2 次提交

net: ipv4: add sysctl for nexthop api compatibility mode · 4f80116d

由 Roopa Prabhu 提交于 4月 27, 2020

Current route nexthop API maintains user space compatibility
with old route API by default. Dumps and netlink notifications
support both new and old API format. In systems which have
moved to the new API, this compatibility mode cancels some
of the performance benefits provided by the new nexthop API.

This patch adds new sysctl nexthop_compat_mode which is on
by default but provides the ability to turn off compatibility
mode allowing systems to run entirely with the new routing
API. Old route API behaviour and support is not modified by this
sysctl.

Uses a single sysctl to cover both ipv4 and ipv6 following
other sysctls. Covers dumps and delete notifications as
suggested by David Ahern.
Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
Reviewed-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4f80116d

net: ipv6: new arg skip_notify to ip6_rt_del · 11dd74b3

由 Roopa Prabhu 提交于 4月 27, 2020

Used in subsequent work to skip route delete
notifications on nexthop deletes.
Suggested-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
Reviewed-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

11dd74b3

27 4月, 2020 1 次提交

sysctl: pass kernel pointers to ->proc_handler · 32927393

由 Christoph Hellwig 提交于 4月 24, 2020

Instead of having all the sysctl handlers deal with user pointers, which
is rather hairy in terms of the BPF interaction, copy the input to and
from  userspace in common code.  This also means that the strings are
always NUL-terminated by the common code, making the API a little bit
safer.

As most handler just pass through the data to one of the common handlers
a lot of the changes are mechnical.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Acked-by: NAndrey Ignatov <rdna@fb.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

32927393

30 3月, 2020 1 次提交

net: add net available in build_state · faee6769

由 Alexander Aring 提交于 3月 27, 2020

The build_state callback of lwtunnel doesn't contain the net namespace
structure yet. This patch will add it so we can check on specific
address configuration at creation time of rpl source routes.
Signed-off-by: NAlexander Aring <alex.aring@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

faee6769

24 3月, 2020 1 次提交

Remove DST_HOST · af13b3c3

由 David Laight 提交于 3月 23, 2020

Previous changes to the IP routing code have removed all the
tests for the DS_HOST route flag.
Remove the flags and all the code that sets it.
Signed-off-by: NDavid Laight <david.laight@aculab.com>
Acked-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

af13b3c3

13 3月, 2020 1 次提交

inet: Use fallthrough; · a8eceea8

由 Joe Perches 提交于 3月 12, 2020

Convert the various uses of fallthrough comments to fallthrough;

Done via script
Link: https://lore.kernel.org/lkml/b56602fcf79f849e733e7b521bb0e17895d390fa.1582230379.git.joe@perches.com/

And by hand:

net/ipv6/ip6_fib.c has a fallthrough comment outside of an #ifdef block
that causes gcc to emit a warning if converted in-place.

So move the new fallthrough; inside the containing #ifdef/#endif too.
Signed-off-by: NJoe Perches <joe@perches.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a8eceea8

17 2月, 2020 1 次提交

ipv6: Fix nlmsg_flags when splitting a multipath route · afecdb37

由 Benjamin Poirier 提交于 2月 12, 2020

When splitting an RTA_MULTIPATH request into multiple routes and adding the
second and later components, we must not simply remove NLM_F_REPLACE but
instead replace it by NLM_F_CREATE. Otherwise, it may look like the netlink
message was malformed.

For example,
	ip route add 2001:db8::1/128 dev dummy0
	ip route change 2001:db8::1/128 nexthop via fe80::30:1 dev dummy0 \
		nexthop via fe80::30:2 dev dummy0
results in the following warnings:
[ 1035.057019] IPv6: RTM_NEWROUTE with no NLM_F_CREATE or NLM_F_REPLACE
[ 1035.057517] IPv6: NLM_F_CREATE should be set when creating new route

This patch makes the nlmsg sequence look equivalent for __ip6_ins_rt() to
what it would get if the multipath route had been added in multiple netlink
operations:
	ip route add 2001:db8::1/128 dev dummy0
	ip route change 2001:db8::1/128 nexthop via fe80::30:1 dev dummy0
	ip route append 2001:db8::1/128 nexthop via fe80::30:2 dev dummy0

Fixes: 27596472 ("ipv6: fix ECMP route replacement")
Signed-off-by: NBenjamin Poirier <bpoirier@cumulusnetworks.com>
Reviewed-by: NMichal Kubecek <mkubecek@suse.cz>
Reviewed-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

afecdb37

15 1月, 2020 1 次提交

ipv6: Add "offload" and "trap" indications to routes · bb3c4ab9

由 Ido Schimmel 提交于 1月 14, 2020

In a similar fashion to previous patch, add "offload" and "trap"
indication to IPv6 routes.

This is done by using two unused bits in 'struct fib6_info' to hold
these indications. Capable drivers are expected to set these when
processing the various in-kernel route notifications.
Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
Reviewed-by: NJiri Pirko <jiri@mellanox.com>
Reviewed-by: NDavid Ahern <dsahern@gmail.com>
Acked-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

bb3c4ab9

25 12月, 2019 3 次提交

ipv6: Remove old route notifications and convert listeners · caafb250

由 Ido Schimmel 提交于 12月 23, 2019

Now that mlxsw is converted to use the new FIB notifications it is
possible to delete the old ones and use the new replace / append /
delete notifications.
Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
Reviewed-by: NJiri Pirko <jiri@mellanox.com>
Reviewed-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

caafb250

ipv6: Handle multipath route deletion notification · 0284696b

由 Ido Schimmel 提交于 12月 23, 2019

When an entire multipath route is deleted, only emit a notification if
it is the first route in the node. Emit a replace notification in case
the last sibling is followed by another route. Otherwise, emit a delete
notification.
Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
Reviewed-by: NJiri Pirko <jiri@mellanox.com>
Reviewed-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0284696b

ipv6: Notify multipath route if should be offloaded · 0ee0f47c

由 Ido Schimmel 提交于 12月 23, 2019

In a similar fashion to previous patches, only notify the new multipath
route if it is the first route in the node or if it was appended to such
route.

The type of the notification (replace vs. append) is determined based on
the number of routes added ('nhn') and the number of sibling routes. If
the two do not match, then an append notification should be sent.
Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
Reviewed-by: NJiri Pirko <jiri@mellanox.com>
Reviewed-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0ee0f47c

openeuler / Kernel 接近 2 年 前同步成功

openeuler / Kernel
接近 2 年前同步成功