提交 · 996237d9ba4d092469fbfca18995656c32834ac7 · openeuler / Kernel

10 8月, 2022 1 次提交

ipv6: do not use RT_TOS for IPv6 flowlabel · ab7e2e0d

由 Matthias May 提交于 8月 05, 2022

According to Guillaume Nault RT_TOS should never be used for IPv6.

Quote:
RT_TOS() is an old macro used to interprete IPv4 TOS as described in
the obsolete RFC 1349. It's conceptually wrong to use it even in IPv4
code, although, given the current state of the code, most of the
existing calls have no consequence.

But using RT_TOS() in IPv6 code is always a bug: IPv6 never had a "TOS"
field to be interpreted the RFC 1349 way. There's no historical
compatibility to worry about.

Fixes: 571912c6 ("net: UDP tunnel encapsulation module for tunnelling different protocols like MPLS, IP, NSH etc.")
Acked-by: NGuillaume Nault <gnault@redhat.com>
Signed-off-by: NMatthias May <matthias.may@westermo.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

ab7e2e0d

20 7月, 2022 1 次提交

ipv6/udp: support externally provided ubufs · 1fd3ae8c

由 Pavel Begunkov 提交于 7月 12, 2022

Teach ipv6/udp how to use external ubuf_info provided in msghdr and
also prepare it for managed frags by sprinkling
skb_zcopy_downgrade_managed() when it could mix managed and not managed
frags.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

1fd3ae8c

19 7月, 2022 1 次提交

ipv6: avoid partial copy for zc · 773ba4fe

由 Pavel Begunkov 提交于 7月 12, 2022

Even when zerocopy transmission is requested and possible,
__ip_append_data() will still copy a small chunk of data just because it
allocated some extra linear space (e.g. 128 bytes). It wastes CPU cycles
on copy and iter manipulations and also misalignes potentially aligned
data. Avoid such copies. And as a bonus we can allocate smaller skb.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

773ba4fe

09 6月, 2022 1 次提交

ipv6: Fix signed integer overflow in __ip6_append_data · f93431c8

由 Wang Yufen 提交于 6月 07, 2022

Resurrect ubsan overflow checks and ubsan report this warning,
fix it by change the variable [length] type to size_t.

UBSAN: signed-integer-overflow in net/ipv6/ip6_output.c:1489:19
2147479552 + 8567 cannot be represented in type 'int'
CPU: 0 PID: 253 Comm: err Not tainted 5.16.0+ #1
Hardware name: linux,dummy-virt (DT)
Call trace:
  dump_backtrace+0x214/0x230
  show_stack+0x30/0x78
  dump_stack_lvl+0xf8/0x118
  dump_stack+0x18/0x30
  ubsan_epilogue+0x18/0x60
  handle_overflow+0xd0/0xf0
  __ubsan_handle_add_overflow+0x34/0x44
  __ip6_append_data.isra.48+0x1598/0x1688
  ip6_append_data+0x128/0x260
  udpv6_sendmsg+0x680/0xdd0
  inet6_sendmsg+0x54/0x90
  sock_sendmsg+0x70/0x88
  ____sys_sendmsg+0xe8/0x368
  ___sys_sendmsg+0x98/0xe0
  __sys_sendmmsg+0xf4/0x3b8
  __arm64_sys_sendmmsg+0x34/0x48
  invoke_syscall+0x64/0x160
  el0_svc_common.constprop.4+0x124/0x300
  do_el0_svc+0x44/0xc8
  el0_svc+0x3c/0x1e8
  el0t_64_sync_handler+0x88/0xb0
  el0t_64_sync+0x16c/0x170

Changes since v1:
-Change the variable [length] type to unsigned, as Eric Dumazet suggested.
Changes since v2:
-Don't change exthdrlen type in ip6_make_skb, as Paolo Abeni suggested.
Changes since v3:
-Don't change ulen type in udpv6_sendmsg and l2tp_ip6_sendmsg, as
Jakub Kicinski suggested.
Reported-by: NHulk Robot <hulkci@huawei.com>
Signed-off-by: NWang Yufen <wangyufen@huawei.com>
Link: https://lore.kernel.org/r/20220607120028.845916-1-wangyufen@huawei.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

f93431c8

16 5月, 2022 1 次提交

ipv6: Add hop-by-hop header to jumbograms in ip6_output · 80e425b6

由 Coco Li 提交于 5月 13, 2022

Instead of simply forcing a 0 payload_len in IPv6 header,
implement RFC 2675 and insert a custom extension header.

Note that only TCP stack is currently potentially generating
jumbograms, and that this extension header is purely local,
it wont be sent on a physical link.

This is needed so that packet capture (tcpdump and friends)
can properly dissect these large packets.
Signed-off-by: NCoco Li <lixiaoyan@google.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Acked-by: NAlexander Duyck <alexanderduyck@fb.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

80e425b6

30 4月, 2022 2 次提交

ipv6: refactor ip6_finish_output2() · 58f71be5

由 Pavel Begunkov 提交于 4月 28, 2022

Throw neigh checks in ip6_finish_output2() under a single slow path if,
so we don't have the overhead in the hot path.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

58f71be5

ipv6: help __ip6_finish_output() inlining · 4b143ed7

由 Pavel Begunkov 提交于 4月 28, 2022

There are two callers of __ip6_finish_output(), both are in
ip6_finish_output(). We can combine the call sites into one and handle
return code after, that will inline __ip6_finish_output().

Note, error handling under NET_XMIT_CN will only return 0 if
__ip6_finish_output() succeded, and in this case it return 0.
Considering that NET_XMIT_SUCCESS is 0, it'll be returning exactly the
same result for it as before.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4b143ed7

13 4月, 2022 1 次提交

net: ip: add skb drop reasons to ip forwarding · 2edc1a38

由 Menglong Dong 提交于 4月 13, 2022

Replace kfree_skb() which is used in ip6_forward() and ip_forward()
with kfree_skb_reason().

The new drop reason 'SKB_DROP_REASON_PKT_TOO_BIG' is introduced for
the case that the length of the packet exceeds MTU and can't
fragment.
Signed-off-by: NMenglong Dong <imagedong@tencent.com>
Reviewed-by: NJiang Biao <benbjiang@tencent.com>
Reviewed-by: NHao Peng <flyingpeng@tencent.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2edc1a38

11 4月, 2022 1 次提交

ipv6: fix panic when forwarding a pkt with no in6 dev · e3fa461d

由 Nicolas Dichtel 提交于 4月 08, 2022

kongweibin reported a kernel panic in ip6_forward() when input interface
has no in6 dev associated.

The following tc commands were used to reproduce this panic:
tc qdisc del dev vxlan100 root
tc qdisc add dev vxlan100 root netem corrupt 5%

CC: stable@vger.kernel.org
Fixes: ccd27f05 ("ipv6: fix 'disable_policy' for fwd packets")
Reported-by: Nkongweibin <kongweibin2@huawei.com>
Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
Reviewed-by: NDavid Ahern <dsahern@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e3fa461d

16 3月, 2022 1 次提交

net: Add l3mdev index to flow struct and avoid oif reset for port devices · 40867d74

由 David Ahern 提交于 3月 14, 2022

The fundamental premise of VRF and l3mdev core code is binding a socket
to a device (l3mdev or netdev with an L3 domain) to indicate L3 scope.
Legacy code resets flowi_oif to the l3mdev losing any original port
device binding. Ben (among others) has demonstrated use cases where the
original port device binding is important and needs to be retained.
This patch handles that by adding a new entry to the common flow struct
that can indicate the l3mdev index for later rule and table matching
avoiding the need to reset flowi_oif.

In addition to allowing more use cases that require port device binds,
this patch brings a few datapath simplications:

1. l3mdev_fib_rule_match is only called when walking fib rules and
always after l3mdev_update_flow. That allows an optimization to bail
early for non-VRF type uses cases when flowi_l3mdev is not set. Also,
only that index needs to be checked for the FIB table id.

2. l3mdev_update_flow can be called with flowi_oif set to a l3mdev
(e.g., VRF) device. By resetting flowi_oif only for this case the
FLOWI_FLAG_SKIP_NH_OIF flag is not longer needed and can be removed,
removing several checks in the datapath. The flowi_iif path can be
simplified to only be called if the it is not loopback (loopback can
not be assigned to an L3 domain) and the l3mdev index is not already
set.

3. Avoid another device lookup in the output path when the fib lookup
returns a reject failure.

Note: 2 functional tests for local traffic with reject fib rules are
updated to reflect the new direct failure at FIB lookup time for ping
rather than the failure on packet path. The current code fails like this:

HINT: Fails since address on vrf device is out of device scope
COMMAND: ip netns exec ns-A ping -c1 -w1 -I eth1 172.16.3.1
ping: Warning: source address might be selected on device other than: eth1
PING 172.16.3.1 (172.16.3.1) from 172.16.3.1 eth1: 56(84) bytes of data.

--- 172.16.3.1 ping statistics ---
1 packets transmitted, 0 received, 100% packet loss, time 0ms

where the test now directly fails:

HINT: Fails since address on vrf device is out of device scope
COMMAND: ip netns exec ns-A ping -c1 -w1 -I eth1 172.16.3.1
ping: connect: No route to host
Signed-off-by: NDavid Ahern <dsahern@kernel.org>
Tested-by: NBen Greear <greearb@candelatech.com>
Link: https://lore.kernel.org/r/20220314204551.16369-1-dsahern@kernel.orgSigned-off-by: NJakub Kicinski <kuba@kernel.org>

40867d74

12 3月, 2022 1 次提交

net: ipv6: fix skb_over_panic in __ip6_append_data · 5e34af41

由 Tadeusz Struk 提交于 3月 10, 2022

Syzbot found a kernel bug in the ipv6 stack:
LINK: https://syzkaller.appspot.com/bug?id=205d6f11d72329ab8d62a610c44c5e7e25415580
The reproducer triggers it by sending a crafted message via sendmmsg()
call, which triggers skb_over_panic, and crashes the kernel:

skbuff: skb_over_panic: text:ffffffff84647fb4 len:65575 put:65575
head:ffff888109ff0000 data:ffff888109ff0088 tail:0x100af end:0xfec0
dev:<NULL>

Update the check that prevents an invalid packet with MTU equal
to the fregment header size to eat up all the space for payload.

The reproducer can be found here:
LINK: https://syzkaller.appspot.com/text?tag=ReproC&x=1648c83fb00000

Reported-by: syzbot+e223cf47ec8ae183f2a0@syzkaller.appspotmail.com
Signed-off-by: NTadeusz Struk <tadeusz.struk@linaro.org>
Acked-by: NWillem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20220310232538.1044947-1-tadeusz.struk@linaro.orgSigned-off-by: NJakub Kicinski <kuba@kernel.org>

5e34af41

03 3月, 2022 2 次提交

net: Add skb_clear_tstamp() to keep the mono delivery_time · de799101

由 Martin KaFai Lau 提交于 3月 02, 2022

Right now, skb->tstamp is reset to 0 whenever the skb is forwarded.

If skb->tstamp has the mono delivery_time, clearing it can hurt
the performance when it finally transmits out to fq@phy-dev.

The earlier patch added a skb->mono_delivery_time bit to
flag the skb->tstamp carrying the mono delivery_time.

This patch adds skb_clear_tstamp() helper which keeps
the mono delivery_time and clears everything else.

The delivery_time clearing will be postponed until the stack knows the
skb will be delivered locally.  It will be done in a latter patch.
Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

de799101

net: Add skb->mono_delivery_time to distinguish mono delivery_time from (rcv) timestamp · a1ac9c8a

由 Martin KaFai Lau 提交于 3月 02, 2022

skb->tstamp was first used as the (rcv) timestamp.
The major usage is to report it to the user (e.g. SO_TIMESTAMP).

Later, skb->tstamp is also set as the (future) delivery_time (e.g. EDT in TCP)
during egress and used by the qdisc (e.g. sch_fq) to make decision on when
the skb can be passed to the dev.

Currently, there is no way to tell skb->tstamp having the (rcv) timestamp
or the delivery_time, so it is always reset to 0 whenever forwarded
between egress and ingress.

While it makes sense to always clear the (rcv) timestamp in skb->tstamp
to avoid confusing sch_fq that expects the delivery_time, it is a
performance issue [0] to clear the delivery_time if the skb finally
egress to a fq@phy-dev.  For example, when forwarding from egress to
ingress and then finally back to egress:

            tcp-sender => veth@netns => veth@hostns => fq@eth0@hostns
                                     ^              ^
                                     reset          rest

This patch adds one bit skb->mono_delivery_time to flag the skb->tstamp
is storing the mono delivery_time (EDT) instead of the (rcv) timestamp.

The current use case is to keep the TCP mono delivery_time (EDT) and
to be used with sch_fq.  A latter patch will also allow tc-bpf@ingress
to read and change the mono delivery_time.

In the future, another bit (e.g. skb->user_delivery_time) can be added
for the SCM_TXTIME where the clock base is tracked by sk->sk_clockid.

[ This patch is a prep work.  The following patches will
  get the other parts of the stack ready first.  Then another patch
  after that will finally set the skb->mono_delivery_time. ]

skb_set_delivery_time() function is added.  It is used by the tcp_output.c
and during ip[6] fragmentation to assign the delivery_time to
the skb->tstamp and also set the skb->mono_delivery_time.

A note on the change in ip_send_unicast_reply() in ip_output.c.
It is only used by TCP to send reset/ack out of a ctl_sk.
Like the new skb_set_delivery_time(), this patch sets
the skb->mono_delivery_time to 0 for now as a place
holder.  It will be enabled in a latter patch.
A similar case in tcp_ipv6 can be done with
skb_set_delivery_time() in tcp_v6_send_response().

[0] (slide 22): https://linuxplumbersconf.org/event/11/contributions/953/attachments/867/1658/LPC_2021_BPF_Datapath_Extensions.pdfSigned-off-by: NMartin KaFai Lau <kafai@fb.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a1ac9c8a

26 2月, 2022 1 次提交

net: ip: add skb drop reasons for ip egress path · 5e187189

由 Menglong Dong 提交于 2月 26, 2022

Replace kfree_skb() which is used in the packet egress path of IP layer
with kfree_skb_reason(). Functions that are involved include:

__ip_queue_xmit()
ip_finish_output()
ip_mc_finish_output()
ip6_output()
ip6_finish_output()
ip6_finish_output2()

Following new drop reasons are introduced:

SKB_DROP_REASON_IP_OUTNOROUTES
SKB_DROP_REASON_BPF_CGROUP_EGRESS
SKB_DROP_REASON_IPV6DISABLED
SKB_DROP_REASON_NEIGH_CREATEFAIL
Reviewed-by: NMengen Sun <mengensun@tencent.com>
Reviewed-by: NHao Peng <flyingpeng@tencent.com>
Signed-off-by: NMenglong Dong <imagedong@tencent.com>
Reviewed-by: NDavid Ahern <dsahern@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5e187189

18 2月, 2022 1 次提交

net-timestamp: convert sk->sk_tskey to atomic_t · a1cdec57

由 Eric Dumazet 提交于 2月 17, 2022

UDP sendmsg() can be lockless, this is causing all kinds
of data races.

This patch converts sk->sk_tskey to remove one of these races.

BUG: KCSAN: data-race in __ip_append_data / __ip_append_data

read to 0xffff8881035d4b6c of 4 bytes by task 8877 on cpu 1:
 __ip_append_data+0x1c1/0x1de0 net/ipv4/ip_output.c:994
 ip_make_skb+0x13f/0x2d0 net/ipv4/ip_output.c:1636
 udp_sendmsg+0x12bd/0x14c0 net/ipv4/udp.c:1249
 inet_sendmsg+0x5f/0x80 net/ipv4/af_inet.c:819
 sock_sendmsg_nosec net/socket.c:705 [inline]
 sock_sendmsg net/socket.c:725 [inline]
 ____sys_sendmsg+0x39a/0x510 net/socket.c:2413
 ___sys_sendmsg net/socket.c:2467 [inline]
 __sys_sendmmsg+0x267/0x4c0 net/socket.c:2553
 __do_sys_sendmmsg net/socket.c:2582 [inline]
 __se_sys_sendmmsg net/socket.c:2579 [inline]
 __x64_sys_sendmmsg+0x53/0x60 net/socket.c:2579
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x44/0xae

write to 0xffff8881035d4b6c of 4 bytes by task 8880 on cpu 0:
 __ip_append_data+0x1d8/0x1de0 net/ipv4/ip_output.c:994
 ip_make_skb+0x13f/0x2d0 net/ipv4/ip_output.c:1636
 udp_sendmsg+0x12bd/0x14c0 net/ipv4/udp.c:1249
 inet_sendmsg+0x5f/0x80 net/ipv4/af_inet.c:819
 sock_sendmsg_nosec net/socket.c:705 [inline]
 sock_sendmsg net/socket.c:725 [inline]
 ____sys_sendmsg+0x39a/0x510 net/socket.c:2413
 ___sys_sendmsg net/socket.c:2467 [inline]
 __sys_sendmmsg+0x267/0x4c0 net/socket.c:2553
 __do_sys_sendmmsg net/socket.c:2582 [inline]
 __se_sys_sendmmsg net/socket.c:2579 [inline]
 __x64_sys_sendmmsg+0x53/0x60 net/socket.c:2579
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x44/0xae

value changed: 0x0000054d -> 0x0000054e

Reported by Kernel Concurrency Sanitizer on:
CPU: 0 PID: 8880 Comm: syz-executor.5 Not tainted 5.17.0-rc2-syzkaller-00167-gdcb85f85-dirty #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011

Fixes: 09c2d251 ("net-timestamp: add key to disambiguate concurrent datagrams")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Reported-by: Nsyzbot <syzkaller@googlegroups.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a1cdec57

17 2月, 2022 1 次提交

net: ping6: support setting basic SOL_IPV6 options via cmsg · 13651224

由 Jakub Kicinski 提交于 2月 16, 2022

Support setting IPV6_HOPLIMIT, IPV6_TCLASS, IPV6_DONTFRAG
during sendmsg via SOL_IPV6 cmsgs.

tclass and dontfrag are init'ed from struct ipv6_pinfo in
ipcm6_init_sk(), while hlimit is inited to -1, so we need
to handle it being populated via cmsg explicitly.

Leave extension headers and flowlabel unimplemented.
Those are slightly more laborious to test and users
seem to primarily care about IPV6_TCLASS.
Signed-off-by: NJakub Kicinski <kuba@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

13651224

28 1月, 2022 7 次提交

ipv6: optimise dst refcounting on cork init · 40ac240c

由 Pavel Begunkov 提交于 1月 27, 2022

udpv6_sendmsg() doesn't need dst after calling ip6_make_skb(), so
instead of taking an additional reference inside ip6_setup_cork()
and releasing the initial one afterwards, we can hand over a reference
into ip6_make_skb() saving two atomics. The only other user of
ip6_setup_cork() is ip6_append_data() and it requires an extra
dst_hold().
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Reviewed-by: NWillem de Bruijn <willemb@google.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

40ac240c

udp6: pass flow in ip6_make_skb together with cork · f37a4cc6

由 Pavel Begunkov 提交于 1月 27, 2022

Another preparation patch. inet_cork_full already contains a field for
iflow, so we can avoid passing a separate struct iflow6 into
__ip6_append_data() and ip6_make_skb(), and use the flow stored in
inet_cork_full. Make sure callers set cork->fl, i.e. we init it in
ip6_append_data() and before calling ip6_make_skb().
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Reviewed-by: NWillem de Bruijn <willemb@google.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

f37a4cc6

ipv6: pass full cork into __ip6_append_data() · f3b46a3e

由 Pavel Begunkov 提交于 1月 27, 2022

Convert a struct inet_cork argument in __ip6_append_data() to struct
inet_cork_full. As one struct contains another inet_cork is still can
be accessed via ->base field. It's a preparation patch making further
changes a bit cleaner.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Reviewed-by: NWillem de Bruijn <willemb@google.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

f3b46a3e

ipv6: don't zero inet_cork_full::fl after use · 940ea00b

由 Pavel Begunkov 提交于 1月 27, 2022

It doesn't appear there is any reason for ip6_cork_release() to zero
cork->fl, it'll be fully filled on next initialisation. This 88 bytes
memset accounts to 0.3-0.5% of total CPU cycles.
It's also needed in following patches and allows to remove an extar flow
copy in udp_v6_push_pending_frames().
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Reviewed-by: NWillem de Bruijn <willemb@google.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

940ea00b

ipv6: clean up cork setup/release · d656b2ea

由 Pavel Begunkov 提交于 1月 27, 2022

Clean up ip6_setup_cork() and ip6_cork_release() adding a local variable
for v6_cork->opt. It's a preparation patch for further changes.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Reviewed-by: NWillem de Bruijn <willemb@google.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

d656b2ea

ipv6: remove daddr temp buffer in __ip6_make_skb · b60d4e58

由 Pavel Begunkov 提交于 1月 27, 2022

ipv6_push_nfrag_opts() doesn't change passed daddr, and so
__ip6_make_skb() doesn't actually need to keep an on-stack copy of
fl6->daddr. Set initially final_dst to fl6->daddr,
ipv6_push_nfrag_opts() will override it if needed, and get rid of extra
copies.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Reviewed-by: NWillem de Bruijn <willemb@google.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

b60d4e58

ipv6: optimise dst refcounting on skb init · cd3c7480

由 Pavel Begunkov 提交于 1月 27, 2022

__ip6_make_skb() gets a cork->dst ref, hands it over to skb and shortly
after puts cork->dst. Save two atomics by stealing it without extra
referencing, ip6_cork_release() handles NULL cork->dst.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Reviewed-by: NWillem de Bruijn <willemb@google.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

cd3c7480

24 1月, 2022 1 次提交

xfrm: fix MTU regression · 6596a022

由 Jiri Bohac 提交于 1月 19, 2022

Commit 749439bf ("ipv6: fix udpv6
sendmsg crash caused by too small MTU") breaks PMTU for xfrm.

A Packet Too Big ICMPv6 message received in response to an ESP
packet will prevent all further communication through the tunnel
if the reported MTU minus the ESP overhead is smaller than 1280.

E.g. in a case of a tunnel-mode ESP with sha256/aes the overhead
is 92 bytes. Receiving a PTB with MTU of 1371 or less will result
in all further packets in the tunnel dropped. A ping through the
tunnel fails with "ping: sendmsg: Invalid argument".

Apparently the MTU on the xfrm route is smaller than 1280 and
fails the check inside ip6_setup_cork() added by 749439bf.

We found this by debugging USGv6/ipv6ready failures. Failing
tests are: "Phase-2 Interoperability Test Scenario IPsec" /
5.3.11 and 5.4.11 (Tunnel Mode: Fragmentation).

Commit b515d263 ("xfrm:
xfrm_state_mtu should return at least 1280 for ipv6") attempted
to fix this but caused another regression in TCP MSS calculations
and had to be reverted.

The patch below fixes the situation by dropping the MTU
check and instead checking for the underflows described in the
749439bf commit message.
Signed-off-by: NJiri Bohac <jbohac@suse.cz>
Fixes: 749439bf ("ipv6: fix udpv6 sendmsg crash caused by too small MTU")
Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>

6596a022

22 11月, 2021 1 次提交

ipv6: fix typos in __ip6_finish_output() · 19d36c5f

由 Eric Dumazet 提交于 11月 18, 2021

We deal with IPv6 packets, so we need to use IP6CB(skb)->flags and
IP6SKB_REROUTED, instead of IPCB(skb)->flags and IPSKB_REROUTED

Found by code inspection, please double check that fixing this bug
does not surface other bugs.

Fixes: 09ee9dba ("ipv6: Reinject IPv6 packets if IPsec policy matches after SNAT")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Tobias Brunner <tobias@strongswan.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: David Ahern <dsahern@kernel.org>
Reviewed-by: NDavid Ahern <dsahern@kernel.org>
Tested-by: NTobias Brunner <tobias@strongswan.org>
Acked-by: NTobias Brunner <tobias@strongswan.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

19d36c5f

16 11月, 2021 1 次提交

net: remove sk_route_nocaps · aba54656

由 Eric Dumazet 提交于 11月 15, 2021

Instead of using a full netdev_features_t, we can use a single bit,
as sk_route_nocaps is only used to remove NETIF_F_GSO_MASK from
sk->sk_route_cap.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

aba54656

16 10月, 2021 1 次提交

ipv6: When forwarding count rx stats on the orig netdev · 0857d6f8

由 Stephen Suryaputra 提交于 10月 14, 2021

Commit bdb7cc64 ("ipv6: Count interface receive statistics on the
ingress netdev") does not work when ip6_forward() executes on the skbs
with vrf-enslaved netdev. Use IP6CB(skb)->iif to get to the right one.

Add a selftest script to verify.

Fixes: bdb7cc64 ("ipv6: Count interface receive statistics on the ingress netdev")
Signed-off-by: NStephen Suryaputra <ssuryaextr@gmail.com>
Reviewed-by: NDavid Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20211014130845.410602-1-ssuryaextr@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

0857d6f8

03 8月, 2021 2 次提交

ipv6: use skb_expand_head in ip6_xmit · 0c9f227b

由 Vasily Averin 提交于 8月 02, 2021

Unlike skb_realloc_headroom, new helper skb_expand_head
does not allocate a new skb if possible.

Additionally this patch replaces commonly used dereferencing with variables.
Signed-off-by: NVasily Averin <vvs@virtuozzo.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0c9f227b

ipv6: use skb_expand_head in ip6_finish_output2 · e415ed3a

由 Vasily Averin 提交于 8月 02, 2021

Unlike skb_realloc_headroom, new helper skb_expand_head does not allocate
a new skb if possible.

Additionally this patch replaces commonly used dereferencing with variables.
Signed-off-by: NVasily Averin <vvs@virtuozzo.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e415ed3a

23 7月, 2021 1 次提交

ipv6: decrease hop limit counter in ip6_forward() · 46c7655f

由 Kangmin Park 提交于 7月 23, 2021

Decrease hop limit counter when deliver skb to ndp proxy.
Signed-off-by: NKangmin Park <l4stpr0gr4m@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

46c7655f

21 7月, 2021 1 次提交

net: ipv6: introduce ip6_dst_mtu_maybe_forward · 427faee1

由 Vadim Fedorenko 提交于 7月 20, 2021

Replace ip6_dst_mtu_forward with ip6_dst_mtu_maybe_forward and
reuse this code in ip6_mtu. Actually these two functions were
almost duplicates, this change will simplify the maintaince of
mtu calculation code.
Signed-off-by: NVadim Fedorenko <vfedorenko@novek.ru>
Reviewed-by: NDavid Ahern <dsahern@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

427faee1

20 7月, 2021 1 次提交

ipv6: ip6_finish_output2: set sk into newly allocated nskb · 2d85a1b3

由 Vasily Averin 提交于 7月 19, 2021

skb_set_owner_w() should set sk not to old skb but to new nskb.

Fixes: 5796015f ("ipv6: allocate enough headroom in ip6_finish_output2()")
Signed-off-by: NVasily Averin <vvs@virtuozzo.com>
Link: https://lore.kernel.org/r/70c0744f-89ae-1869-7e3e-4fa292158f4b@virtuozzo.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

2d85a1b3

13 7月, 2021 1 次提交

ipv6: allocate enough headroom in ip6_finish_output2() · 5796015f

由 Vasily Averin 提交于 7月 12, 2021

When TEE target mirrors traffic to another interface, sk_buff may
not have enough headroom to be processed correctly.
ip_finish_output2() detect this situation for ipv4 and allocates
new skb with enogh headroom. However ipv6 lacks this logic in
ip_finish_output2 and it leads to skb_under_panic:

 skbuff: skb_under_panic: text:ffffffffc0866ad4 len:96 put:24
 head:ffff97be85e31800 data:ffff97be85e317f8 tail:0x58 end:0xc0 dev:gre0
 ------------[ cut here ]------------
 kernel BUG at net/core/skbuff.c:110!
 invalid opcode: 0000 [#1] SMP PTI
 CPU: 2 PID: 393 Comm: kworker/2:2 Tainted: G           OE     5.13.0 #13
 Hardware name: Virtuozzo KVM, BIOS 1.11.0-2.vz7.4 04/01/2014
 Workqueue: ipv6_addrconf addrconf_dad_work
 RIP: 0010:skb_panic+0x48/0x4a
 Call Trace:
  skb_push.cold.111+0x10/0x10
  ipgre_header+0x24/0xf0 [ip_gre]
  neigh_connected_output+0xae/0xf0
  ip6_finish_output2+0x1a8/0x5a0
  ip6_output+0x5c/0x110
  nf_dup_ipv6+0x158/0x1000 [nf_dup_ipv6]
  tee_tg6+0x2e/0x40 [xt_TEE]
  ip6t_do_table+0x294/0x470 [ip6_tables]
  nf_hook_slow+0x44/0xc0
  nf_hook.constprop.34+0x72/0xe0
  ndisc_send_skb+0x20d/0x2e0
  ndisc_send_ns+0xd1/0x210
  addrconf_dad_work+0x3c8/0x540
  process_one_work+0x1d1/0x370
  worker_thread+0x30/0x390
  kthread+0x116/0x130
  ret_from_fork+0x22/0x30
Signed-off-by: NVasily Averin <vvs@virtuozzo.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5796015f

07 7月, 2021 1 次提交

ipv6: fix 'disable_policy' for fwd packets · ccd27f05

由 Nicolas Dichtel 提交于 7月 06, 2021

The goal of commit df789fe7 ("ipv6: Provide ipv6 version of
"disable_policy" sysctl") was to have the disable_policy from ipv4
available on ipv6.
However, it's not exactly the same mechanism. On IPv4, all packets coming
from an interface, which has disable_policy set, bypass the policy check.
For ipv6, this is done only for local packets, ie for packets destinated to
an address configured on the incoming interface.

Let's align ipv6 with ipv4 so that the 'disable_policy' sysctl has the same
effect for both protocols.

My first approach was to create a new kind of route cache entries, to be
able to set DST_NOPOLICY without modifying routes. This would have added a
lot of code. Because the local delivery path is already handled, I choose
to focus on the forwarding path to minimize code churn.

Fixes: df789fe7 ("ipv6: Provide ipv6 version of "disable_policy" sysctl")
Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ccd27f05

25 6月, 2021 2 次提交

ipv6: delete useless dst check in ip6_dst_lookup_tail · c305b9e6

由 zhang kai 提交于 6月 24, 2021

parameter dst always points to null.
Signed-off-by: Nzhang kai <zhangkaiheb@126.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c305b9e6

net: ip: avoid OOM kills with large UDP sends over loopback · 6d123b81

由 Jakub Kicinski 提交于 6月 23, 2021

Dave observed number of machines hitting OOM on the UDP send
path. The workload seems to be sending large UDP packets over
loopback. Since loopback has MTU of 64k kernel will try to
allocate an skb with up to 64k of head space. This has a good
chance of failing under memory pressure. What's worse if
the message length is <32k the allocation may trigger an
OOM killer.

This is entirely avoidable, we can use an skb with page frags.

af_unix solves a similar problem by limiting the head
length to SKB_MAX_ALLOC. This seems like a good and simple
approach. It means that UDP messages > 16kB will now
use fragments if underlying device supports SG, if extra
allocator pressure causes regressions in real workloads
we can switch to trying the large allocation first and
falling back.

v4: pre-calculate all the additions to alloclen so
    we can be sure it won't go over order-2
Reported-by: NDave Jones <dsj@fb.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6d123b81

04 2月, 2021 1 次提交

net: use indirect call helpers for dst_output · 6585d7dc

由 Brian Vazquez 提交于 2月 01, 2021

This patch avoids the indirect call for the common case:
ip6_output and ip_output
Signed-off-by: NBrian Vazquez <brianvv@google.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

6585d7dc

10 1月, 2021 1 次提交

net: ipv6: Validate GSO SKB before finish IPv6 processing · b210de4f

由 Aya Levin 提交于 1月 07, 2021

There are cases where GSO segment's length exceeds the egress MTU:
 - Forwarding of a TCP GRO skb, when DF flag is not set.
 - Forwarding of an skb that arrived on a virtualisation interface
   (virtio-net/vhost/tap) with TSO/GSO size set by other network
   stack.
 - Local GSO skb transmitted on an NETIF_F_TSO tunnel stacked over an
   interface with a smaller MTU.
 - Arriving GRO skb (or GSO skb in a virtualised environment) that is
   bridged to a NETIF_F_TSO tunnel stacked over an interface with an
   insufficient MTU.

If so:
 - Consume the SKB and its segments.
 - Issue an ICMP packet with 'Packet Too Big' message containing the
   MTU, allowing the source host to reduce its Path MTU appropriately.

Note: These cases are handled in the same manner in IPv4 output finish.
This patch aligns the behavior of IPv6 and the one of IPv4.

Fixes: 9e508490 ("netfilter: ipv6: move POSTROUTING invocation before fragmentation")
Signed-off-by: NAya Levin <ayal@nvidia.com>
Reviewed-by: NTariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/1610027418-30438-1-git-send-email-ayal@nvidia.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

b210de4f

08 1月, 2021 2 次提交

skbuff: Rename skb_zcopy_{get|put} to net_zcopy_{get|put} · 8e044917

由 Jonathan Lemon 提交于 1月 06, 2021

Unlike the rest of the skb_zcopy_ functions, these routines
operate on a 'struct ubuf', not a skb.  Remove the 'skb_'
prefix from the naming to make things clearer.
Suggested-by: NWillem de Bruijn <willemdebruijn.kernel@gmail.com>
Signed-off-by: NJonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

8e044917

skbuff: rename sock_zerocopy_* to msg_zerocopy_* · 8c793822

由 Jonathan Lemon 提交于 1月 06, 2021

At Willem's suggestion, rename the sock_zerocopy_* functions
so that they match the MSG_ZEROCOPY flag, which makes it clear
they are specific to this zerocopy implementation.
Signed-off-by: NJonathan Lemon <jonathan.lemon@gmail.com>
Acked-by: NWillem de Bruijn <willemb@google.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

8c793822

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功