提交 · 40867d74c374b235e14d839f3a77f26684feefe5 · openeuler / Kernel

16 3月, 2022 1 次提交

net: Add l3mdev index to flow struct and avoid oif reset for port devices · 40867d74

由 David Ahern 提交于 3月 14, 2022

The fundamental premise of VRF and l3mdev core code is binding a socket
to a device (l3mdev or netdev with an L3 domain) to indicate L3 scope.
Legacy code resets flowi_oif to the l3mdev losing any original port
device binding. Ben (among others) has demonstrated use cases where the
original port device binding is important and needs to be retained.
This patch handles that by adding a new entry to the common flow struct
that can indicate the l3mdev index for later rule and table matching
avoiding the need to reset flowi_oif.

In addition to allowing more use cases that require port device binds,
this patch brings a few datapath simplications:

1. l3mdev_fib_rule_match is only called when walking fib rules and
always after l3mdev_update_flow. That allows an optimization to bail
early for non-VRF type uses cases when flowi_l3mdev is not set. Also,
only that index needs to be checked for the FIB table id.

2. l3mdev_update_flow can be called with flowi_oif set to a l3mdev
(e.g., VRF) device. By resetting flowi_oif only for this case the
FLOWI_FLAG_SKIP_NH_OIF flag is not longer needed and can be removed,
removing several checks in the datapath. The flowi_iif path can be
simplified to only be called if the it is not loopback (loopback can
not be assigned to an L3 domain) and the l3mdev index is not already
set.

3. Avoid another device lookup in the output path when the fib lookup
returns a reject failure.

Note: 2 functional tests for local traffic with reject fib rules are
updated to reflect the new direct failure at FIB lookup time for ping
rather than the failure on packet path. The current code fails like this:

HINT: Fails since address on vrf device is out of device scope
COMMAND: ip netns exec ns-A ping -c1 -w1 -I eth1 172.16.3.1
ping: Warning: source address might be selected on device other than: eth1
PING 172.16.3.1 (172.16.3.1) from 172.16.3.1 eth1: 56(84) bytes of data.

--- 172.16.3.1 ping statistics ---
1 packets transmitted, 0 received, 100% packet loss, time 0ms

where the test now directly fails:

HINT: Fails since address on vrf device is out of device scope
COMMAND: ip netns exec ns-A ping -c1 -w1 -I eth1 172.16.3.1
ping: connect: No route to host
Signed-off-by: NDavid Ahern <dsahern@kernel.org>
Tested-by: NBen Greear <greearb@candelatech.com>
Link: https://lore.kernel.org/r/20220314204551.16369-1-dsahern@kernel.orgSigned-off-by: NJakub Kicinski <kuba@kernel.org>

40867d74

18 2月, 2022 1 次提交

ipv4: fix data races in fib_alias_hw_flags_set · 9fcf986c

由 Eric Dumazet 提交于 2月 16, 2022

fib_alias_hw_flags_set() can be used by concurrent threads,
and is only RCU protected.

We need to annotate accesses to following fields of struct fib_alias:

    offload, trap, offload_failed

Because of READ_ONCE()WRITE_ONCE() limitations, make these
field u8.

BUG: KCSAN: data-race in fib_alias_hw_flags_set / fib_alias_hw_flags_set

read to 0xffff888134224a6a of 1 bytes by task 2013 on cpu 1:
 fib_alias_hw_flags_set+0x28a/0x470 net/ipv4/fib_trie.c:1050
 nsim_fib4_rt_hw_flags_set drivers/net/netdevsim/fib.c:350 [inline]
 nsim_fib4_rt_add drivers/net/netdevsim/fib.c:367 [inline]
 nsim_fib4_rt_insert drivers/net/netdevsim/fib.c:429 [inline]
 nsim_fib4_event drivers/net/netdevsim/fib.c:461 [inline]
 nsim_fib_event drivers/net/netdevsim/fib.c:881 [inline]
 nsim_fib_event_work+0x1852/0x2cf0 drivers/net/netdevsim/fib.c:1477
 process_one_work+0x3f6/0x960 kernel/workqueue.c:2307
 process_scheduled_works kernel/workqueue.c:2370 [inline]
 worker_thread+0x7df/0xa70 kernel/workqueue.c:2456
 kthread+0x1bf/0x1e0 kernel/kthread.c:377
 ret_from_fork+0x1f/0x30

write to 0xffff888134224a6a of 1 bytes by task 4872 on cpu 0:
 fib_alias_hw_flags_set+0x2d5/0x470 net/ipv4/fib_trie.c:1054
 nsim_fib4_rt_hw_flags_set drivers/net/netdevsim/fib.c:350 [inline]
 nsim_fib4_rt_add drivers/net/netdevsim/fib.c:367 [inline]
 nsim_fib4_rt_insert drivers/net/netdevsim/fib.c:429 [inline]
 nsim_fib4_event drivers/net/netdevsim/fib.c:461 [inline]
 nsim_fib_event drivers/net/netdevsim/fib.c:881 [inline]
 nsim_fib_event_work+0x1852/0x2cf0 drivers/net/netdevsim/fib.c:1477
 process_one_work+0x3f6/0x960 kernel/workqueue.c:2307
 process_scheduled_works kernel/workqueue.c:2370 [inline]
 worker_thread+0x7df/0xa70 kernel/workqueue.c:2456
 kthread+0x1bf/0x1e0 kernel/kthread.c:377
 ret_from_fork+0x1f/0x30

value changed: 0x00 -> 0x02

Reported by Kernel Concurrency Sanitizer on:
CPU: 0 PID: 4872 Comm: kworker/0:0 Not tainted 5.17.0-rc3-syzkaller-00188-g1d41d2e8-dirty #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Workqueue: events nsim_fib_event_work

Fixes: 90b93f1b ("ipv4: Add "offload" and "trap" indications to routes")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Reported-by: Nsyzbot <syzkaller@googlegroups.com>
Reviewed-by: NIdo Schimmel <idosch@nvidia.com>
Link: https://lore.kernel.org/r/20220216173217.3792411-1-eric.dumazet@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

9fcf986c

11 2月, 2022 1 次提交

ipv4: add (struct uncached_list)->quarantine list · 29e5375d

由 Eric Dumazet 提交于 2月 10, 2022

This is an optimization to keep the per-cpu lists as short as possible:

Whenever rt_flush_dev() changes one rtable dst.dev
matching the disappearing device, it can can transfer the object
to a quarantine list, waiting for a final rt_del_uncached_list().
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

29e5375d

08 2月, 2022 1 次提交

ipv4: Use dscp_t in struct fib_alias · 32ccf110

由 Guillaume Nault 提交于 2月 04, 2022

Use the new dscp_t type to replace the fa_tos field of fib_alias. This
ensures ECN bits are ignored and makes the field compatible with the
fc_dscp field of struct fib_config.

Converting old *tos variables and fields to dscp_t allows sparse to
flag incorrect uses of DSCP and ECN bits. This patch is entirely about
type annotation and shouldn't change any existing behaviour.
Signed-off-by: NGuillaume Nault <gnault@redhat.com>
Acked-by: NDavid Ahern <dsahern@kernel.org>
Reviewed-by: NToke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

32ccf110

31 1月, 2022 1 次提交

ipv4: Make ip_idents_reserve static · 47ed9442

由 David Ahern 提交于 1月 28, 2022

ip_idents_reserve is only used in net/ipv4/route.c. Make it static
and remove the export.
Signed-off-by: NDavid Ahern <dsahern@kernel.org>
Cc: Eric Dumazet <edumazet@google.com>
Reviewed-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

47ed9442

27 1月, 2022 1 次提交

ipv4: Namespaceify min_adv_mss sysctl knob · 2e9589ff

由 xu xin 提交于 1月 26, 2022

Different netns has different requirement on the setting of min_adv_mss
sysctl which the advertised MSS will be never lower than.

Enable min_adv_mss to be configured per network namespace.
Signed-off-by: Nxu xin <xu.xin16@zte.com.cn>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2e9589ff

04 1月, 2022 2 次提交

Namespaceify mtu_expires sysctl · 1135fad2

由 xu xin 提交于 1月 04, 2022

This patch enables the sysctl mtu_expires to be configured per net
namespace.
Signed-off-by: Nxu xin <xu.xin16@zte.com.cn>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1135fad2

Namespaceify min_pmtu sysctl · 1de6b15a

由 xu xin 提交于 1月 04, 2022

This patch enables the sysctl min_pmtu to be configured per net
namespace.
Signed-off-by: Nxu xin <xu.xin16@zte.com.cn>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1de6b15a

07 12月, 2021 1 次提交

net: dst: add net device refcount tracking to dst_entry · 9038c320

由 Eric Dumazet 提交于 12月 04, 2021

We want to track all dev_hold()/dev_put() to ease leak hunting.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

9038c320

17 11月, 2021 1 次提交

net: align static siphash keys · 49ecc2e9

由 Eric Dumazet 提交于 11月 15, 2021

siphash keys use 16 bytes.

Define siphash_aligned_key_t macro so that we can make sure they
are not crossing a cache line boundary.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

49ecc2e9

20 9月, 2021 1 次提交

net/ipv4/route.c: remove superfluous header files from route.c · ffa66f15

由 Mianhan Liu 提交于 9月 20, 2021

route.c hasn't use any macro or function declared in uaccess.h, types.h,
string.h, sockios.h, times.h, protocol.h, arp.h and l3mdev.h. Thus, these
files can be removed from route.c safely without affecting the compilation
of the net module.
Signed-off-by: NMianhan Liu <liumh1@shanghaitech.edu.cn>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ffa66f15

31 8月, 2021 1 次提交

ipv4: fix endianness issue in inet_rtm_getroute_build_skb() · 92548b0e

由 Eric Dumazet 提交于 8月 30, 2021

The UDP length field should be in network order.
This removes the following sparse error:

net/ipv4/route.c:3173:27: warning: incorrect type in assignment (different base types)
net/ipv4/route.c:3173:27:    expected restricted __be16 [usertype] len
net/ipv4/route.c:3173:27:    got unsigned long

Fixes: 404eb77e ("ipv4: support sport, dport and ip_proto in RTM_GETROUTE")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Roopa Prabhu <roopa@nvidia.com>
Cc: David Ahern <dsahern@kernel.org>
Reviewed-by: NDavid Ahern <dsahern@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

92548b0e

30 8月, 2021 1 次提交

ipv4: make exception cache less predictible · 67d6d681

由 Eric Dumazet 提交于 8月 29, 2021

Even after commit 6457378f ("ipv4: use siphash instead of Jenkins in
fnhe_hashfun()"), an attacker can still use brute force to learn
some secrets from a victim linux host.

One way to defeat these attacks is to make the max depth of the hash
table bucket a random value.

Before this patch, each bucket of the hash table used to store exceptions
could contain 6 items under attack.

After the patch, each bucket would contains a random number of items,
between 6 and 10. The attacker can no longer infer secrets.

This is slightly increasing memory size used by the hash table,
by 50% in average, we do not expect this to be a problem.

This patch is more complex than the prior one (IPv6 equivalent),
because IPv4 was reusing the oldest entry.
Since we need to be able to evict more than one entry per
update_or_create_fnhe() call, I had to replace
fnhe_oldest() with fnhe_remove_oldest().

Also note that we will queue extra kfree_rcu() calls under stress,
which hopefully wont be a too big issue.

Fixes: 4895c771 ("ipv4: Add FIB nexthop exceptions.")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Reported-by: NKeyu Man <kman001@ucr.edu>
Cc: Willy Tarreau <w@1wt.eu>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
Reviewed-by: NDavid Ahern <dsahern@kernel.org>
Tested-by: NDavid Ahern <dsahern@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

67d6d681

26 8月, 2021 1 次提交

ipv4: use siphash instead of Jenkins in fnhe_hashfun() · 6457378f

由 Eric Dumazet 提交于 8月 25, 2021

A group of security researchers brought to our attention
the weakness of hash function used in fnhe_hashfun().

Lets use siphash instead of Jenkins Hash, to considerably
reduce security risks.

Also remove the inline keyword, this really is distracting.

Fixes: d546c621 ("ipv4: harden fnhe_hashfun()")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Reported-by: NKeyu Man <kman001@ucr.edu>
Cc: Willy Tarreau <w@1wt.eu>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6457378f

05 8月, 2021 1 次提交

net: Remove redundant if statements · 1160dfa1

由 Yajun Deng 提交于 8月 05, 2021

The 'if (dev)' statement already move into dev_{put , hold}, so remove
redundant if statements.
Signed-off-by: NYajun Deng <yajun.deng@linux.dev>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1160dfa1

03 8月, 2021 1 次提交

net: Keep vertical alignment · 0547ffe6

由 Yajun Deng 提交于 8月 02, 2021

Those files under /proc/net/stat/ don't have vertical alignment, it looks
very difficult. Modify the seq_printf statement, keep vertical alignment.

v2:
 - Use seq_puts() and seq_printf() correctly.
Reported-by: Nkernel test robot <lkp@intel.com>
Signed-off-by: NYajun Deng <yajun.deng@linux.dev>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0547ffe6

21 7月, 2021 1 次提交

net: ipv4: Consolidate ipv4_mtu and ip_dst_mtu_maybe_forward · ac6627a2

由 Vadim Fedorenko 提交于 7月 20, 2021

Consolidate IPv4 MTU code the same way it is done in IPv6 to have code
aligned in both address families
Signed-off-by: NVadim Fedorenko <vfedorenko@novek.ru>
Reviewed-by: NDavid Ahern <dsahern@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ac6627a2

29 6月, 2021 1 次提交

net: lwtunnel: handle MTU calculation in forwading · fade5641

由 Vadim Fedorenko 提交于 6月 25, 2021

Commit 14972cbd ("net: lwtunnel: Handle fragmentation") moved
fragmentation logic away from lwtunnel by carry encap headroom and
use it in output MTU calculation. But the forwarding part was not
covered and created difference in MTU for output and forwarding and
further to silent drops on ipv4 forwarding path. Fix it by taking
into account lwtunnel encap headroom.

The same commit also introduced difference in how to treat RTAX_MTU
in IPv4 and IPv6 where latter explicitly removes lwtunnel encap
headroom from route MTU. Make IPv4 version do the same.

Fixes: 14972cbd ("net: lwtunnel: Handle fragmentation")
Suggested-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NVadim Fedorenko <vfedorenko@novek.ru>
Reviewed-by: NDavid Ahern <dsahern@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fade5641

15 6月, 2021 1 次提交

ipv4: Fix device used for dst_alloc with local routes · b87b04f5

由 David Ahern 提交于 6月 12, 2021

Oliver reported a use case where deleting a VRF device can hang
waiting for the refcnt to drop to 0. The root cause is that the dst
is allocated against the VRF device but cached on the loopback
device.

The use case (added to the selftests) has an implicit VRF crossing
due to the ordering of the FIB rules (lookup local is before the
l3mdev rule, but the problem occurs even if the FIB rules are
re-ordered with local after l3mdev because the VRF table does not
have a default route to terminate the lookup). The end result is
is that the FIB lookup returns the loopback device as the nexthop,
but the ingress device is in a VRF. The mismatch causes the dst
alloc against the VRF device but then cached on the loopback.

The fix is to bring the trick used for IPv6 (see ip6_rt_get_dev_rcu):
pick the dst alloc device based the fib lookup result but with checks
that the result has a nexthop device (e.g., not an unreachable or
prohibit entry).

Fixes: f5a0aab8 ("net: ipv4: dst for local input routes should use l3mdev if relevant")
Reported-by: NOliver Herms <oliver.peter.herms@gmail.com>
Signed-off-by: NDavid Ahern <dsahern@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b87b04f5

19 5月, 2021 2 次提交

ipv4: Add custom multipath hash policy · 4253b498

由 Ido Schimmel 提交于 5月 17, 2021

Add a new multipath hash policy where the packet fields used for hash
calculation are determined by user space via the
fib_multipath_hash_fields sysctl that was introduced in the previous
patch.

The current set of available packet fields includes both outer and inner
fields, which requires two invocations of the flow dissector. Avoid
unnecessary dissection of the outer or inner flows by skipping
dissection if none of the outer or inner fields are required.

In accordance with the existing policies, when an skb is not available,
packet fields are extracted from the provided flow key. In which case,
only outer fields are considered.
Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
Reviewed-by: NDavid Ahern <dsahern@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4253b498

ipv4: Calculate multipath hash inside switch statement · 2e68ea92

由 Ido Schimmel 提交于 5月 17, 2021

A subsequent patch will add another multipath hash policy where the
multipath hash is calculated directly by the policy specific code and
not outside of the switch statement.

Prepare for this change by moving the multipath hash calculation inside
the switch statement.

No functional changes intended.
Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
Reviewed-by: NDavid Ahern <dsahern@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2e68ea92

25 3月, 2021 1 次提交

inet: use bigger hash table for IP ID generation · aa6dd211

由 Eric Dumazet 提交于 3月 24, 2021

In commit 73f156a6 ("inetpeer: get rid of ip_id_count")
I used a very small hash table that could be abused
by patient attackers to reveal sensitive information.

Switch to a dynamic sizing, depending on RAM size.

Typical big hosts will now use 128x more storage (2 MB)
to get a similar increase in security and reduction
of hash collisions.

As a bonus, use of alloc_large_system_hash() spreads
allocated memory among all NUMA nodes.

Fixes: 73f156a6 ("inetpeer: get rid of ip_id_count")
Reported-by: NAmit Klein <aksecurity@gmail.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Willy Tarreau <w@1wt.eu>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

aa6dd211

17 3月, 2021 1 次提交

net: ipv4: route.c: simplify procfs code · f105f26e

由 Yejune Deng 提交于 3月 16, 2021

proc_creat_seq() that directly take a struct seq_operations,
and deal with network namespaces in ->open.
Signed-off-by: NYejune Deng <yejune.deng@gmail.com>
Reviewed-by: NJakub Kicinski <kuba@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f105f26e

13 3月, 2021 1 次提交

net: ipv4: route.c: Fix indentation of multi line comment. · 6ad08600

由 Shubhankar Kuranagatti 提交于 3月 12, 2021

All comment lines inside the comment block have been aligned.
Every line of comment starts with a * (uniformity in code).
Signed-off-by: NShubhankar Kuranagatti <shubhankarvk@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6ad08600

11 3月, 2021 2 次提交

net: ipv4: route.c: fix space before tab · 6b9c8f46

由 Shubhankar Kuranagatti 提交于 3月 11, 2021

The extra space before tab space has been removed.
Signed-off-by: NShubhankar Kuranagatti <shubhankarvk@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6b9c8f46

net: Consolidate common blackhole dst ops · c4c877b2

由 Daniel Borkmann 提交于 3月 10, 2021

Move generic blackhole dst ops to the core and use them from both
ipv4_dst_blackhole_ops and ip6_dst_blackhole_ops where possible. No
functional change otherwise. We need these also in other locations
and having to define them over and over again is not great.
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c4c877b2

09 2月, 2021 1 次提交

IPv4: Add "offload failed" indication to routes · 36c5100e

由 Amit Cohen 提交于 2月 07, 2021

After installing a route to the kernel, user space receives an
acknowledgment, which means the route was installed in the kernel, but not
necessarily in hardware.

The asynchronous nature of route installation in hardware can lead to a
routing daemon advertising a route before it was actually installed in
hardware. This can result in packet loss or mis-routed packets until the
route is installed in hardware.

To avoid such cases, previous patch set added the ability to emit
RTM_NEWROUTE notifications whenever RTM_F_OFFLOAD/RTM_F_TRAP flags
are changed, this behavior is controlled by sysctl.

With the above mentioned behavior, it is possible to know from user-space
if the route was offloaded, but if the offload fails there is no indication
to user-space. Following a failure, a routing daemon will wait indefinitely
for a notification that will never come.

This patch adds an "offload_failed" indication to IPv4 routes, so that
users will have better visibility into the offload process.

'struct fib_alias', and 'struct fib_rt_info' are extended with new field
that indicates if route offload failed. Note that the new field is added
using unused bit and therefore there is no need to increase structs size.
Signed-off-by: NAmit Cohen <amcohen@nvidia.com>
Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

36c5100e

05 2月, 2021 1 次提交

net: fix building errors on powerpc when CONFIG_RETPOLINE is not set · 9c97921a

由 Brian Vazquez 提交于 2月 04, 2021

This commit fixes the errores reported when building for powerpc:

ERROR: modpost: "ip6_dst_check" [vmlinux] is a static EXPORT_SYMBOL
ERROR: modpost: "ipv4_dst_check" [vmlinux] is a static EXPORT_SYMBOL
ERROR: modpost: "ipv4_mtu" [vmlinux] is a static EXPORT_SYMBOL
ERROR: modpost: "ip6_mtu" [vmlinux] is a static EXPORT_SYMBOL

Fixes: f67fbeae ("net: use indirect call helpers for dst_mtu")
Fixes: bbd807df ("net: indirect call helpers for ipv4/ipv6 dst_check functions")
Reported-by: NStephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: NBrian Vazquez <brianvv@google.com>
Link: https://lore.kernel.org/r/20210204181839.558951-2-brianvv@google.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

9c97921a

04 2月, 2021 2 次提交

net: indirect call helpers for ipv4/ipv6 dst_check functions · bbd807df

由 Brian Vazquez 提交于 2月 01, 2021

This patch avoids the indirect call for the common case:
ip6_dst_check and ipv4_dst_check
Signed-off-by: NBrian Vazquez <brianvv@google.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

bbd807df

net: use indirect call helpers for dst_mtu · f67fbeae

由 Brian Vazquez 提交于 2月 01, 2021

This patch avoids the indirect call for the common case:
ip6_mtu and ipv4_mtu
Signed-off-by: NBrian Vazquez <brianvv@google.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

f67fbeae

29 11月, 2020 1 次提交

ipv4: Fix tos mask in inet_rtm_getroute() · 1ebf1790

由 Guillaume Nault 提交于 11月 26, 2020

When inet_rtm_getroute() was converted to use the RCU variants of
ip_route_input() and ip_route_output_key(), the TOS parameters
stopped being masked with IPTOS_RT_MASK before doing the route lookup.

As a result, "ip route get" can return a different route than what
would be used when sending real packets.

For example:

    $ ip route add 192.0.2.11/32 dev eth0
    $ ip route add unreachable 192.0.2.11/32 tos 2
    $ ip route get 192.0.2.11 tos 2
    RTNETLINK answers: No route to host

But, packets with TOS 2 (ECT(0) if interpreted as an ECN bit) would
actually be routed using the first route:

    $ ping -c 1 -Q 2 192.0.2.11
    PING 192.0.2.11 (192.0.2.11) 56(84) bytes of data.
    64 bytes from 192.0.2.11: icmp_seq=1 ttl=64 time=0.173 ms

    --- 192.0.2.11 ping statistics ---
    1 packets transmitted, 1 received, 0% packet loss, time 0ms
    rtt min/avg/max/mdev = 0.173/0.173/0.173/0.000 ms

This patch re-applies IPTOS_RT_MASK in inet_rtm_getroute(), to
return results consistent with real route lookups.

Fixes: 3765d35e ("net: ipv4: Convert inet_rtm_getroute to rcu versions of route lookup")
Signed-off-by: NGuillaume Nault <gnault@redhat.com>
Reviewed-by: NDavid Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/b2d237d08317ca55926add9654a48409ac1b8f5b.1606412894.git.gnault@redhat.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

1ebf1790

15 11月, 2020 1 次提交

IPv4: RTM_GETROUTE: Add RTA_ENCAP to result · ae8cb932

由 Oliver Herms 提交于 11月 13, 2020

This patch adds an IPv4 routes encapsulation attribute
to the result of netlink RTM_GETROUTE requests
(e.g. ip route get 192.0.2.1).
Signed-off-by: NOliver Herms <oliver.peter.herms@gmail.com>
Reviewed-by: NDavid Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20201113085517.GA1307262@twsSigned-off-by: NJakub Kicinski <kuba@kernel.org>

ae8cb932

12 11月, 2020 1 次提交

net: evaluate net.ipvX.conf.all.disable_policy and disable_xfrm · 62679a8d

由 Vincent Bernat 提交于 11月 07, 2020

The disable_policy and disable_xfrm are a per-interface sysctl to
disable IPsec policy or encryption on an interface. However, while a
"all" variant is exposed, it was a noop since it was never evaluated.
We use the usual "or" logic for this kind of sysctls.
Signed-off-by: NVincent Bernat <vincent@bernat.ch>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

62679a8d

11 10月, 2020 1 次提交

ipv4: Restore flowi4_oif update before call to xfrm_lookup_route · 874fb9e2

由 David Ahern 提交于 10月 09, 2020

Tobias reported regressions in IPsec tests following the patch
referenced by the Fixes tag below. The root cause is dropping the
reset of the flowi4_oif after the fib_lookup. Apparently it is
needed for xfrm cases, so restore the oif update to ip_route_output_flow
right before the call to xfrm_lookup_route.

Fixes: 2fbc6e89 ("ipv4: Update exception handling for multipath routes via same device")
Reported-by: NTobias Brunner <tobias@strongswan.org>
Signed-off-by: NDavid Ahern <dsahern@kernel.org>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

874fb9e2

16 9月, 2020 1 次提交

ipv4: Update exception handling for multipath routes via same device · 2fbc6e89

由 David Ahern 提交于 9月 14, 2020

Kfir reported that pmtu exceptions are not created properly for
deployments where multipath routes use the same device.

After some digging I see 2 compounding problems:
1. ip_route_output_key_hash_rcu is updating the flowi4_oif *after*
   the route lookup. This is the second use case where this has
   been a problem (the first is related to use of vti devices with
   VRF). I can not find any reason for the oif to be changed after the
   lookup; the code goes back to the start of git. It does not seem
   logical so remove it.

2. fib_lookups for exceptions do not call fib_select_path to handle
   multipath route selection based on the hash.

The end result is that the fib_lookup used to add the exception
always creates it based using the first leg of the route.

An example topology showing the problem:

                 |  host1
             +------+
             | eth0 |  .209
             +------+
                 |
             +------+
     switch  | br0  |
             +------+
                 |
       +---------+---------+
       | host2             |  host3
   +------+             +------+
   | eth0 | .250        | eth0 | 192.168.252.252
   +------+             +------+

   +-----+             +-----+
   | vti | .2          | vti | 192.168.247.3
   +-----+             +-----+
       \                  /
 =================================
 tunnels
         192.168.247.1/24

for h in host1 host2 host3; do
        ip netns add ${h}
        ip -netns ${h} link set lo up
        ip netns exec ${h} sysctl -wq net.ipv4.ip_forward=1
done

ip netns add switch
ip -netns switch li set lo up
ip -netns switch link add br0 type bridge stp 0
ip -netns switch link set br0 up

for n in 1 2 3; do
        ip -netns switch link add eth-sw type veth peer name eth-h${n}
        ip -netns switch li set eth-h${n} master br0 up
        ip -netns switch li set eth-sw netns host${n} name eth0
done

ip -netns host1 addr add 192.168.252.209/24 dev eth0
ip -netns host1 link set dev eth0 up
ip -netns host1 route add 192.168.247.0/24 \
        nexthop via 192.168.252.250 dev eth0 nexthop via 192.168.252.252 dev eth0

ip -netns host2 addr add 192.168.252.250/24 dev eth0
ip -netns host2 link set dev eth0 up

ip -netns host2 addr add 192.168.252.252/24 dev eth0
ip -netns host3 link set dev eth0 up

ip netns add tunnel
ip -netns tunnel li set lo up
ip -netns tunnel li add br0 type bridge
ip -netns tunnel li set br0 up
for n in $(seq 11 20); do
        ip -netns tunnel addr add dev br0 192.168.247.${n}/24
done

for n in 2 3
do
        ip -netns tunnel link add vti${n} type veth peer name eth${n}
        ip -netns tunnel link set eth${n} mtu 1360 master br0 up
        ip -netns tunnel link set vti${n} netns host${n} mtu 1360 up
        ip -netns host${n} addr add dev vti${n} 192.168.247.${n}/24
done
ip -netns tunnel ro add default nexthop via 192.168.247.2 nexthop via 192.168.247.3

ip netns exec host1 ping -M do -s 1400 -c3 -I 192.168.252.209 192.168.247.11
ip netns exec host1 ping -M do -s 1400 -c3 -I 192.168.252.209 192.168.247.15
ip -netns host1 ro ls cache

Before this patch the cache always shows exceptions against the first
leg in the multipath route; 192.168.252.250 per this example. Since the
hash has an initial random seed, you may need to vary the final octet
more than what is listed. In my tests, using addresses between 11 and 19
usually found 1 that used both legs.

With this patch, the cache will have exceptions for both legs.

Fixes: 4895c771 ("ipv4: Add FIB nexthop exceptions")
Reported-by: NKfir Itzhak <mastertheknife@gmail.com>
Signed-off-by: NDavid Ahern <dsahern@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2fbc6e89

15 9月, 2020 1 次提交

ipv4: Initialize flowi4_multipath_hash in data path · 1869e226

由 David Ahern 提交于 9月 13, 2020

flowi4_multipath_hash was added by the commit referenced below for
tunnels. Unfortunately, the patch did not initialize the new field
for several fast path lookups that do not initialize the entire flow
struct to 0. Fix those locations. Currently, flowi4_multipath_hash
is random garbage and affects the hash value computed by
fib_multipath_hash for multipath selection.

Fixes: 24ba1440 ("route: Add multipath_hash in flowi_common to make user-define hash")
Signed-off-by: NDavid Ahern <dsahern@gmail.com>
Cc: wenxu <wenxu@ucloud.cn>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1869e226

01 9月, 2020 1 次提交

net: clean up codestyle · 5af68891

由 Miaohe Lin 提交于 8月 29, 2020

This is a pure codestyle cleanup patch. No functional change intended.
Signed-off-by: NMiaohe Lin <linmiaohe@huawei.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5af68891

25 8月, 2020 2 次提交

net: clean up codestyle for net/ipv4 · 343d8c60

由 Miaohe Lin 提交于 8月 25, 2020

This is a pure codestyle cleanup patch. Also add a blank line after
declarations as warned by checkpatch.pl.
Signed-off-by: NMiaohe Lin <linmiaohe@huawei.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

343d8c60

net: gain ipv4 mtu when mtu is not locked · 8b4510d7

由 Miaohe Lin 提交于 8月 24, 2020

When mtu is locked, we should not obtain ipv4 mtu as we return immediately
in this case and leave acquired ipv4 mtu unused.
Signed-off-by: NMiaohe Lin <linmiaohe@huawei.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8b4510d7

05 8月, 2020 1 次提交

ipv4: route: Ignore output interface in FIB lookup for PMTU route · df23bb18

由 Stefano Brivio 提交于 8月 04, 2020

Currently, processes sending traffic to a local bridge with an
encapsulation device as a port don't get ICMP errors if they exceed
the PMTU of the encapsulated link.

David Ahern suggested this as a hack, but it actually looks like
the correct solution: when we update the PMTU for a given destination
by means of updating or creating a route exception, the encapsulation
might trigger this because of PMTU discovery happening either on the
encapsulation device itself, or its lower layer. This happens on
bridged encapsulations only.

The output interface shouldn't matter, because we already have a
valid destination. Drop the output interface restriction from the
associated route lookup.

For UDP tunnels, we will now have a route exception created for the
encapsulation itself, with a MTU value reflecting its headroom, which
allows a bridge forwarding IP packets originated locally to deliver
errors back to the sending socket.

The behaviour is now consistent with IPv6 and verified with selftests
pmtu_ipv{4,6}_br_{geneve,vxlan}{4,6}_exception introduced later in
this series.

v2:
- reset output interface only for bridge ports (David Ahern)
- add and use netif_is_any_bridge_port() helper (David Ahern)
Suggested-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NStefano Brivio <sbrivio@redhat.com>
Reviewed-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

df23bb18

openeuler / Kernel 接近 2 年 前同步成功

openeuler / Kernel
接近 2 年前同步成功