提交 · ceed9038b2783d14e0422bdc6fd04f70580efb4c · openeuler / Kernel

19 1月, 2021 2 次提交

ipv6: set multicast flag on the multicast route · ceed9038

由 Matteo Croce 提交于 1月 15, 2021

The multicast route ff00::/8 is created with type RTN_UNICAST:

  $ ip -6 -d route
  unicast ::1 dev lo proto kernel scope global metric 256 pref medium
  unicast fe80::/64 dev eth0 proto kernel scope global metric 256 pref medium
  unicast ff00::/8 dev eth0 proto kernel scope global metric 256 pref medium

Set the type to RTN_MULTICAST which is more appropriate.

Fixes: e8478e80 ("net/ipv6: Save route type in rt6_info")
Signed-off-by: NMatteo Croce <mcroce@microsoft.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

ceed9038

ipv6: create multicast route with RTPROT_KERNEL · a826b043

由 Matteo Croce 提交于 1月 15, 2021

The ff00::/8 multicast route is created without specifying the fc_protocol
field, so the default RTPROT_BOOT value is used:

  $ ip -6 -d route
  unicast ::1 dev lo proto kernel scope global metric 256 pref medium
  unicast fe80::/64 dev eth0 proto kernel scope global metric 256 pref medium
  unicast ff00::/8 dev eth0 proto boot scope global metric 256 pref medium

As the documentation says, this value identifies routes installed during
boot, but the route is created when interface is set up.
Change the value to RTPROT_KERNEL which is a better value.

Fixes: 1da177e4 ("Linux-2.6.12-rc2")
Signed-off-by: NMatteo Croce <mcroce@microsoft.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

a826b043

15 1月, 2021 1 次提交

net: sit: unregister_netdevice on newlink's error path · 47e4bb14

由 Jakub Kicinski 提交于 1月 13, 2021

We need to unregister the netdevice if config failed.
.ndo_uninit takes care of most of the heavy lifting.

This was uncovered by recent commit c269a24c ("net: make
free_netdev() more lenient with unregistering devices").
Previously the partially-initialized device would be left
in the system.

Reported-and-tested-by: syzbot+2393580080a2da190f04@syzkaller.appspotmail.com
Fixes: e2f1f072 ("sit: allow to configure 6rd tunnels via netlink")
Acked-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
Link: https://lore.kernel.org/r/20210114012947.2515313-1-kuba@kernel.orgSigned-off-by: NJakub Kicinski <kuba@kernel.org>

47e4bb14

12 1月, 2021 1 次提交

esp: avoid unneeded kmap_atomic call · 9bd6b629

由 Willem de Bruijn 提交于 1月 09, 2021

esp(6)_output_head uses skb_page_frag_refill to allocate a buffer for
the esp trailer.

It accesses the page with kmap_atomic to handle highmem. But
skb_page_frag_refill can return compound pages, of which
kmap_atomic only maps the first underlying page.

skb_page_frag_refill does not return highmem, because flag
__GFP_HIGHMEM is not set. ESP uses it in the same manner as TCP.
That also does not call kmap_atomic, but directly uses page_address,
in skb_copy_to_page_nocache. Do the same for ESP.

This issue has become easier to trigger with recent kmap local
debugging feature CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP.

Fixes: cac2661c ("esp4: Avoid skb_cow_data whenever possible")
Fixes: 03e2a30f ("esp6: Avoid skb_cow_data whenever possible")
Signed-off-by: NWillem de Bruijn <willemb@google.com>
Acked-by: NSteffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

9bd6b629

10 1月, 2021 1 次提交

net: ipv6: Validate GSO SKB before finish IPv6 processing · b210de4f

由 Aya Levin 提交于 1月 07, 2021

There are cases where GSO segment's length exceeds the egress MTU:
 - Forwarding of a TCP GRO skb, when DF flag is not set.
 - Forwarding of an skb that arrived on a virtualisation interface
   (virtio-net/vhost/tap) with TSO/GSO size set by other network
   stack.
 - Local GSO skb transmitted on an NETIF_F_TSO tunnel stacked over an
   interface with a smaller MTU.
 - Arriving GRO skb (or GSO skb in a virtualised environment) that is
   bridged to a NETIF_F_TSO tunnel stacked over an interface with an
   insufficient MTU.

If so:
 - Consume the SKB and its segments.
 - Issue an ICMP packet with 'Packet Too Big' message containing the
   MTU, allowing the source host to reduce its Path MTU appropriately.

Note: These cases are handled in the same manner in IPv4 output finish.
This patch aligns the behavior of IPv6 and the one of IPv4.

Fixes: 9e508490 ("netfilter: ipv6: move POSTROUTING invocation before fragmentation")
Signed-off-by: NAya Levin <ayal@nvidia.com>
Reviewed-by: NTariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/1610027418-30438-1-git-send-email-ayal@nvidia.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

b210de4f

08 1月, 2021 1 次提交

net: ipv6: fib: flush exceptions when purging route · d8f5c296

由 Sean Tranchetti 提交于 1月 05, 2021

Route removal is handled by two code paths. The main removal path is via
fib6_del_route() which will handle purging any PMTU exceptions from the
cache, removing all per-cpu copies of the DST entry used by the route, and
releasing the fib6_info struct.

The second removal location is during fib6_add_rt2node() during a route
replacement operation. This path also calls fib6_purge_rt() to handle
cleaning up the per-cpu copies of the DST entries and releasing the
fib6_info associated with the older route, but it does not flush any PMTU
exceptions that the older route had. Since the older route is removed from
the tree during the replacement, we lose any way of accessing it again.

As these lingering DSTs and the fib6_info struct are holding references to
the underlying netdevice struct as well, unregistering that device from the
kernel can never complete.

Fixes: 2b760fcf ("ipv6: hook up exception table to store dst cache")
Signed-off-by: NSean Tranchetti <stranche@codeaurora.org>
Reviewed-by: NDavid Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/1609892546-11389-1-git-send-email-stranche@quicinc.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

d8f5c296

18 12月, 2020 1 次提交

netfilter: x_tables: Update remaining dereference to RCU · 443d6e86

由 Subash Abhinov Kasiviswanathan 提交于 12月 16, 2020

This fixes the dereference to fetch the RCU pointer when holding
the appropriate xtables lock.
Reported-by: Nkernel test robot <lkp@intel.com>
Fixes: cc00bcaa ("netfilter: x_tables: Switch synchronization to RCU")
Signed-off-by: NSubash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Reviewed-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

443d6e86

10 12月, 2020 1 次提交

tcp: Retain ECT bits for tos reflection · 8ef44b6f

由 Wei Wang 提交于 12月 08, 2020

For DCTCP, we have to retain the ECT bits set by the congestion control
algorithm on the socket when reflecting syn TOS in syn-ack, in order to
make ECN work properly.

Fixes: ac8f1710 ("tcp: reflect tos value received in SYN to the socket")
Reported-by: NAlexander Duyck <alexanderduyck@fb.com>
Signed-off-by: NWei Wang <weiwan@google.com>
Reviewed-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8ef44b6f

09 12月, 2020 1 次提交

net: ipv6: rpl_iptunnel: simplify the return expression of rpl_do_srh() · 9faad250

由 Zheng Yongjun 提交于 12月 08, 2020

Simplify the return expression.
Signed-off-by: NZheng Yongjun <zhengyongjun3@huawei.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9faad250

08 12月, 2020 1 次提交

netfilter: x_tables: Switch synchronization to RCU · cc00bcaa

由 Subash Abhinov Kasiviswanathan 提交于 11月 25, 2020

When running concurrent iptables rules replacement with data, the per CPU
sequence count is checked after the assignment of the new information.
The sequence count is used to synchronize with the packet path without the
use of any explicit locking. If there are any packets in the packet path using
the table information, the sequence count is incremented to an odd value and
is incremented to an even after the packet process completion.

The new table value assignment is followed by a write memory barrier so every
CPU should see the latest value. If the packet path has started with the old
table information, the sequence counter will be odd and the iptables
replacement will wait till the sequence count is even prior to freeing the
old table info.

However, this assumes that the new table information assignment and the memory
barrier is actually executed prior to the counter check in the replacement
thread. If CPU decides to execute the assignment later as there is no user of
the table information prior to the sequence check, the packet path in another
CPU may use the old table information. The replacement thread would then free
the table information under it leading to a use after free in the packet
processing context-

Unable to handle kernel NULL pointer dereference at virtual
address 000000000000008e
pc : ip6t_do_table+0x5d0/0x89c
lr : ip6t_do_table+0x5b8/0x89c
ip6t_do_table+0x5d0/0x89c
ip6table_filter_hook+0x24/0x30
nf_hook_slow+0x84/0x120
ip6_input+0x74/0xe0
ip6_rcv_finish+0x7c/0x128
ipv6_rcv+0xac/0xe4
__netif_receive_skb+0x84/0x17c
process_backlog+0x15c/0x1b8
napi_poll+0x88/0x284
net_rx_action+0xbc/0x23c
__do_softirq+0x20c/0x48c

This could be fixed by forcing instruction order after the new table
information assignment or by switching to RCU for the synchronization.

Fixes: 80055dab ("netfilter: x_tables: make xt_replace_table wait until old rules are not used anymore")
Reported-by: NSean Tranchetti <stranche@codeaurora.org>
Reported-by: Nkernel test robot <lkp@intel.com>
Suggested-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NSubash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

cc00bcaa

05 12月, 2020 5 次提交

seg6: add VRF support for SRv6 End.DT6 behavior · 20a081b7

由 Andrea Mayer 提交于 12月 02, 2020

SRv6 End.DT6 is defined in the SRv6 Network Programming [1].

The Linux kernel already offers an implementation of the SRv6
End.DT6 behavior which permits IPv6 L3 VPNs over SRv6 networks. This
implementation is not particularly suitable in contexts where we need to
deploy IPv6 L3 VPNs among different tenants which share the same network
address schemes. The underlying problem lies in the fact that the
current version of DT6 (called legacy DT6 from now on) needs a complex
configuration to be applied on routers which requires ad-hoc routes and
routing policy rules to ensure the correct isolation of tenants.

Consequently, a new implementation of DT6 has been introduced with the
aim of simplifying the construction of IPv6 L3 VPN services in the
multi-tenant environment using SRv6 networks. To accomplish this task,
we reused the same VRF infrastructure and SRv6 core components already
exploited for implementing the SRv6 End.DT4 behavior.

Currently the two End.DT6 implementations coexist seamlessly and can be
used depending on the context and the user preferences. So, in order to
support both versions of DT6 a new attribute (vrftable) has been
introduced which allows us to differentiate the implementation of the
behavior to be used.

A SRv6 End.DT6 legacy behavior is still instantiated using a command
like the following one:

$ ip -6 route add 2001:db8::1 encap seg6local action End.DT6 table 100 dev eth0

While to instantiate the SRv6 End.DT6 in VRF mode, the command is still
pretty straight forward:

$ ip -6 route add 2001:db8::1 encap seg6local action End.DT6 vrftable 100 dev eth0.

Obviously as in the case of SRv6 End.DT4, the VRF strict_mode parameter
must be set (net.vrf.strict_mode=1) and the VRF associated with table
100 must exist.

Please note that the instances of SRv6 End.DT6 legacy and End.DT6 VRF
mode can coexist in the same system/configuration without problems.

[1] https://tools.ietf.org/html/draft-ietf-spring-srv6-network-programmingSigned-off-by: NAndrea Mayer <andrea.mayer@uniroma2.it>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

20a081b7

seg6: add support for the SRv6 End.DT4 behavior · 664d6f86

由 Andrea Mayer 提交于 12月 02, 2020

SRv6 End.DT4 is defined in the SRv6 Network Programming [1].

The SRv6 End.DT4 is used to implement IPv4 L3VPN use-cases in
multi-tenants environments. It decapsulates the received packets and it
performs IPv4 routing lookup in the routing table of the tenant.

The SRv6 End.DT4 Linux implementation leverages a VRF device in order to
force the routing lookup into the associated routing table.

To make the End.DT4 work properly, it must be guaranteed that the routing
table used for routing lookup operations is bound to one and only one
VRF during the tunnel creation. Such constraint has to be enforced by
enabling the VRF strict_mode sysctl parameter, i.e:
 $ sysctl -wq net.vrf.strict_mode=1.

At JANOG44, LINE corporation presented their multi-tenant DC architecture
using SRv6 [2]. In the slides, they reported that the Linux kernel is
missing the support of SRv6 End.DT4 behavior.

The SRv6 End.DT4 behavior can be instantiated using a command similar to
the following:

 $ ip route add 2001:db8::1 encap seg6local action End.DT4 vrftable 100 dev eth0

We introduce the "vrftable" extension in iproute2 in a following patch.

[1] https://tools.ietf.org/html/draft-ietf-spring-srv6-network-programming
[2] https://speakerdeck.com/line_developers/line-data-center-networking-with-srv6Signed-off-by: NAndrea Mayer <andrea.mayer@uniroma2.it>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

664d6f86

seg6: add callbacks for customizing the creation/destruction of a behavior · cfdf64a0

由 Andrea Mayer 提交于 12月 02, 2020

We introduce two callbacks used for customizing the creation/destruction of
a SRv6 behavior. Such callbacks are defined in the new struct
seg6_local_lwtunnel_ops and hereafter we provide a brief description of
them:

 - build_state(...): used for calling the custom constructor of the
   behavior during its initialization phase and after all the attributes
   have been parsed successfully;

 - destroy_state(...): used for calling the custom destructor of the
   behavior before it is completely destroyed.
Signed-off-by: NAndrea Mayer <andrea.mayer@uniroma2.it>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

cfdf64a0

seg6: add support for optional attributes in SRv6 behaviors · 0a3021f1

由 Andrea Mayer 提交于 12月 02, 2020

Before this patch, each SRv6 behavior specifies a set of required
attributes that must be provided by the userspace application when such
behavior is going to be instantiated. If at least one of the required
attributes is not provided, the creation of the behavior fails.

The SRv6 behavior framework lacks a way to manage optional attributes.
By definition, an optional attribute for a SRv6 behavior consists of an
attribute which may or may not be provided by the userspace. Therefore,
if an optional attribute is missing (and thus not supplied by the user)
the creation of the behavior goes ahead without any issue.

This patch explicitly differentiates the required attributes from the
optional attributes. In particular, each behavior can declare a set of
required attributes and a set of optional ones.

The semantic of the required attributes remains *totally* unaffected by
this patch. The introduction of the optional attributes does NOT impact
on the backward compatibility of the existing SRv6 behaviors.

It is essential to note that if an (optional or required) attribute is
supplied to a SRv6 behavior which does not expect it, the behavior
simply discards such attribute without generating any error or warning.
This operating mode remained unchanged both before and after the
introduction of the optional attributes extension.

The optional attributes are one of the key components used to implement
the SRv6 End.DT6 behavior based on the Virtual Routing and Forwarding
(VRF) framework. The optional attributes make possible the coexistence
of the already existing SRv6 End.DT6 implementation with the new SRv6
End.DT6 VRF-based implementation without breaking any backward
compatibility. Further details on the SRv6 End.DT6 behavior (VRF mode)
are reported in subsequent patches.

From the userspace point of view, the support for optional attributes DO
NOT require any changes to the userspace applications, i.e: iproute2
unless new attributes (required or optional) are needed.
Signed-off-by: NAndrea Mayer <andrea.mayer@uniroma2.it>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

0a3021f1

seg6: improve management of behavior attributes · 964adce5

由 Andrea Mayer 提交于 12月 02, 2020

Depending on the attribute (i.e.: SEG6_LOCAL_SRH, SEG6_LOCAL_TABLE, etc),
the parse() callback performs some validity checks on the provided input
and updates the tunnel state (slwt) with the result of the parsing
operation. However, an attribute may also need to reserve some additional
resources (i.e.: memory or setting up an eBPF program) in the parse()
callback to complete the parsing operation.

The parse() callbacks are invoked by the parse_nla_action() for each
attribute belonging to a specific behavior. Given a behavior with N
attributes, if the parsing of the i-th attribute fails, the
parse_nla_action() returns immediately with an error. Nonetheless, the
resources acquired during the parsing of the i-1 attributes are not freed
by the parse_nla_action().

Attributes which acquire resources must release them *in an explicit way*
in both the seg6_local_{build/destroy}_state(). However, adding a new
attribute of this type requires changes to
seg6_local_{build/destroy}_state() to release the resources correctly.

The seg6local infrastructure still lacks a simple and structured way to
release the resources acquired in the parse() operations.

We introduced a new callback in the struct seg6_action_param named
destroy(). This callback releases any resource which may have been acquired
in the parse() counterpart. Each attribute may or may not implement the
destroy() callback depending on whether it needs to free some acquired
resources.

The destroy() callback comes with several of advantages:

 1) we can have many attributes as we want for a given behavior with no
    need to explicitly free the taken resources;

 2) As in case of the seg6_local_build_state(), the
    seg6_local_destroy_state() does not need to handle the release of
    resources directly. Indeed, it calls the destroy_attrs() function which
    is in charge of calling the destroy() callback for every set attribute.
    We do not need to patch seg6_local_{build/destroy}_state() anymore as
    we add new attributes;

 3) the code is more readable and better structured. Indeed, all the
    information needed to handle a given attribute are contained in only
    one place;

 4) it facilitates the integration with new features introduced in further
    patches.
Signed-off-by: NAndrea Mayer <andrea.mayer@uniroma2.it>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

964adce5

04 12月, 2020 1 次提交

tcp: merge 'init_req' and 'route_req' functions · 7ea851d1

由 Florian Westphal 提交于 11月 30, 2020

The Multipath-TCP standard (RFC 8684) says that an MPTCP host should send
a TCP reset if the token in a MP_JOIN request is unknown.

At this time we don't do this, the 3whs completes and the 'new subflow'
is reset afterwards.  There are two ways to allow MPTCP to send the
reset.

1. override 'send_synack' callback and emit the rst from there.
   The drawback is that the request socket gets inserted into the
   listeners queue just to get removed again right away.

2. Send the reset from the 'route_req' function instead.
   This avoids the 'add&remove request socket', but route_req lacks the
   skb that is required to send the TCP reset.

Instead of just adding the skb to that function for MPTCP sake alone,
Paolo suggested to merge init_req and route_req functions.

This saves one indirection from syn processing path and provides the skb
to the merged function at the same time.

'send reset on unknown mptcp join token' is added in next patch.
Suggested-by: NPaolo Abeni <pabeni@redhat.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

7ea851d1

03 12月, 2020 2 次提交

bpf: Allow bpf_{s,g}etsockopt from cgroup bind{4,6} hooks · 427167c0

由 Stanislav Fomichev 提交于 12月 02, 2020

I have to now lock/unlock socket for the bind hook execution.
That shouldn't cause any overhead because the socket is unbound
and shouldn't receive any traffic.
Signed-off-by: NStanislav Fomichev <sdf@google.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Acked-by: NAndrey Ignatov <rdna@fb.com>
Link: https://lore.kernel.org/bpf/20201202172516.3483656-3-sdf@google.com

427167c0

net: ip6_gre: set dev->hard_header_len when using header_ops · 832ba596

由 Antoine Tenart 提交于 11月 30, 2020

syzkaller managed to crash the kernel using an NBMA ip6gre interface. I
could reproduce it creating an NBMA ip6gre interface and forwarding
traffic to it:

  skbuff: skb_under_panic: text:ffffffff8250e927 len:148 put:44 head:ffff8c03c7a33
  ------------[ cut here ]------------
  kernel BUG at net/core/skbuff.c:109!
  Call Trace:
  skb_push+0x10/0x10
  ip6gre_header+0x47/0x1b0
  neigh_connected_output+0xae/0xf0

ip6gre tunnel provides its own header_ops->create, and sets it
conditionally when initializing the tunnel in NBMA mode. When
header_ops->create is used, dev->hard_header_len should reflect the
length of the header created. Otherwise, when not used,
dev->needed_headroom should be used.

Fixes: eb95f52f ("net: ipv6_gre: Fix GRO to work on IPv6 over GRE tap")
Cc: Maria Pasechnik <mariap@mellanox.com>
Signed-off-by: NAntoine Tenart <atenart@kernel.org>
Link: https://lore.kernel.org/r/20201130161911.464106-1-atenart@kernel.orgSigned-off-by: NJakub Kicinski <kuba@kernel.org>

832ba596

02 12月, 2020 1 次提交

net/ipv6: propagate user pointer annotation · 9e39394f

由 Lukas Bulwahn 提交于 11月 27, 2020

For IPV6_2292PKTOPTIONS, do_ipv6_getsockopt() stores the user pointer
optval in the msg_control field of the msghdr.

Hence, sparse rightfully warns at ./net/ipv6/ipv6_sockglue.c:1151:33:

  warning: incorrect type in assignment (different address spaces)
      expected void *msg_control
      got char [noderef] __user *optval

Since commit 1f466e1f ("net: cleanly handle kernel vs user buffers for
->msg_control"), user pointers shall be stored in the msg_control_user
field, and kernel pointers in the msg_control field. This allows to
propagate __user annotations nicely through this struct.

Store optval in msg_control_user to properly record and propagate the
memory space annotation of this pointer.

Note that msg_control_is_user is set to true, so the key invariant, i.e.,
use msg_control_user if and only if msg_control_is_user is true, holds.

The msghdr is further used in the six alternative put_cmsg() calls, with
msg_control_is_user being true, put_cmsg() picks msg_control_user
preserving the __user annotation and passes that properly to
copy_to_user().

No functional change. No change in object code.
Signed-off-by: NLukas Bulwahn <lukas.bulwahn@gmail.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20201127093421.21673-1-lukas.bulwahn@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

9e39394f

01 12月, 2020 1 次提交

netfilter: use actual socket sk for REJECT action · 04295878

由 Jan Engelhardt 提交于 11月 21, 2020

True to the message of commit v5.10-rc1-105-g46d6c5ae, _do_
actually make use of state->sk when possible, such as in the REJECT
modules.
Reported-by: NMinqiang Chen <ptpt52@gmail.com>
Cc: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: NJan Engelhardt <jengelh@inai.de>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

04295878

26 11月, 2020 1 次提交

ipv6: addrlabel: fix possible memory leak in ip6addrlbl_net_init · e255e11e

由 Wang Hai 提交于 11月 24, 2020

kmemleak report a memory leak as follows:

unreferenced object 0xffff8880059c6a00 (size 64):
  comm "ip", pid 23696, jiffies 4296590183 (age 1755.384s)
  hex dump (first 32 bytes):
    20 01 00 10 00 00 00 00 00 00 00 00 00 00 00 00   ...............
    1c 00 00 00 00 00 00 00 00 00 00 00 07 00 00 00  ................
  backtrace:
    [<00000000aa4e7a87>] ip6addrlbl_add+0x90/0xbb0
    [<0000000070b8d7f1>] ip6addrlbl_net_init+0x109/0x170
    [<000000006a9ca9d4>] ops_init+0xa8/0x3c0
    [<000000002da57bf2>] setup_net+0x2de/0x7e0
    [<000000004e52d573>] copy_net_ns+0x27d/0x530
    [<00000000b07ae2b4>] create_new_namespaces+0x382/0xa30
    [<000000003b76d36f>] unshare_nsproxy_namespaces+0xa1/0x1d0
    [<0000000030653721>] ksys_unshare+0x3a4/0x780
    [<0000000007e82e40>] __x64_sys_unshare+0x2d/0x40
    [<0000000031a10c08>] do_syscall_64+0x33/0x40
    [<0000000099df30e7>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

We should free all rules when we catch an error in ip6addrlbl_net_init().
otherwise a memory leak will occur.

Fixes: 2a8cc6c8 ("[IPV6] ADDRCONF: Support RFC3484 configurable address selection policy table.")
Reported-by: NHulk Robot <hulkci@huawei.com>
Signed-off-by: NWang Hai <wanghai38@huawei.com>
Link: https://lore.kernel.org/r/20201124071728.8385-1-wanghai38@huawei.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

e255e11e

25 11月, 2020 1 次提交

tcp: Set ECT0 bit in tos/tclass for synack when BPF needs ECN · 407c85c7

由 Alexander Duyck 提交于 11月 20, 2020

When a BPF program is used to select between a type of TCP congestion
control algorithm that uses either ECN or not there is a case where the
synack for the frame was coming up without the ECT0 bit set. A bit of
research found that this was due to the final socket being configured to
dctcp while the listener socket was staying in cubic.

To reproduce it all that is needed is to monitor TCP traffic while running
the sample bpf program "samples/bpf/tcp_cong_kern.c". What is observed,
assuming tcp_dctcp module is loaded or compiled in and the traffic matches
the rules in the sample file, is that for all frames with the exception of
the synack the ECT0 bit is set.

To address that it is necessary to make one additional call to
tcp_bpf_ca_needs_ecn using the request socket and then use the output of
that to set the ECT0 bit for the tos/tclass of the packet.

Fixes: 91b5b21c ("bpf: Add support for changing congestion control")
Signed-off-by: NAlexander Duyck <alexanderduyck@fb.com>
Link: https://lore.kernel.org/r/160593039663.2604.1374502006916871573.stgit@localhost.localdomainSigned-off-by: NJakub Kicinski <kuba@kernel.org>

407c85c7

24 11月, 2020 2 次提交

tcp: fix race condition when creating child sockets from syncookies · 01770a16

由 Ricardo Dias 提交于 11月 20, 2020

When the TCP stack is in SYN flood mode, the server child socket is
created from the SYN cookie received in a TCP packet with the ACK flag
set.

The child socket is created when the server receives the first TCP
packet with a valid SYN cookie from the client. Usually, this packet
corresponds to the final step of the TCP 3-way handshake, the ACK
packet. But is also possible to receive a valid SYN cookie from the
first TCP data packet sent by the client, and thus create a child socket
from that SYN cookie.

Since a client socket is ready to send data as soon as it receives the
SYN+ACK packet from the server, the client can send the ACK packet (sent
by the TCP stack code), and the first data packet (sent by the userspace
program) almost at the same time, and thus the server will equally
receive the two TCP packets with valid SYN cookies almost at the same
instant.

When such event happens, the TCP stack code has a race condition that
occurs between the momement a lookup is done to the established
connections hashtable to check for the existence of a connection for the
same client, and the moment that the child socket is added to the
established connections hashtable. As a consequence, this race condition
can lead to a situation where we add two child sockets to the
established connections hashtable and deliver two sockets to the
userspace program to the same client.

This patch fixes the race condition by checking if an existing child
socket exists for the same client when we are adding the second child
socket to the established connections socket. If an existing child
socket exists, we drop the packet and discard the second child socket
to the same client.
Signed-off-by: NRicardo Dias <rdias@singlestore.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20201120111133.GA67501@rdias-suse-pc.lanSigned-off-by: NJakub Kicinski <kuba@kernel.org>

01770a16

lsm,selinux: pass flowi_common instead of flowi to the LSM hooks · 3df98d79

由 Paul Moore 提交于 9月 27, 2020

As pointed out by Herbert in a recent related patch, the LSM hooks do
not have the necessary address family information to use the flowi
struct safely.  As none of the LSMs currently use any of the protocol
specific flowi information, replace the flowi pointers with pointers
to the address family independent flowi_common struct.
Reported-by: NHerbert Xu <herbert@gondor.apana.org.au>
Acked-by: NJames Morris <jamorris@linux.microsoft.com>
Signed-off-by: NPaul Moore <paul@paul-moore.com>

3df98d79

21 11月, 2020 1 次提交

tcp: Allow full IP tos/IPv6 tclass to be reflected in L3 header · 861602b5

由 Alexander Duyck 提交于 11月 19, 2020

An issue was recently found where DCTCP SYN/ACK packets did not have the
ECT bit set in the L3 header. A bit of code review found that the recent
change referenced below had gone though and added a mask that prevented the
ECN bits from being populated in the L3 header.

This patch addresses that by rolling back the mask so that it is only
applied to the flags coming from the incoming TCP request instead of
applying it to the socket tos/tclass field. Doing this the ECT bits were
restored in the SYN/ACK packets in my testing.

One thing that is not addressed by this patch set is the fact that
tcp_reflect_tos appears to be incompatible with ECN based congestion
avoidance algorithms. At a minimum the feature should likely be documented
which it currently isn't.

Fixes: ac8f1710 ("tcp: reflect tos value received in SYN to the socket")
Signed-off-by: NAlexander Duyck <alexanderduyck@fb.com>
Acked-by: NWei Wang <weiwan@google.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

861602b5

20 11月, 2020 3 次提交

IPv6: RTM_GETROUTE: Add RTA_ENCAP to result · 6b13d8f7

由 Oliver Herms 提交于 11月 19, 2020

This patch adds an IPv6 routes encapsulation attribute
to the result of netlink RTM_GETROUTE requests
(i.e. ip route get 2001:db8::).
Signed-off-by: NOliver Herms <oliver.peter.herms@gmail.com>
Reviewed-by: NDavid Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20201118230651.GA8861@twsSigned-off-by: NJakub Kicinski <kuba@kernel.org>

6b13d8f7

crypto: sha - split sha.h into sha1.h and sha2.h · a24d22b2

由 Eric Biggers 提交于 11月 12, 2020

Currently <crypto/sha.h> contains declarations for both SHA-1 and SHA-2,
and <crypto/sha3.h> contains declarations for SHA-3.

This organization is inconsistent, but more importantly SHA-1 is no
longer considered to be cryptographically secure.  So to the extent
possible, SHA-1 shouldn't be grouped together with any of the other SHA
versions, and usage of it should be phased out.

Therefore, split <crypto/sha.h> into two headers <crypto/sha1.h> and
<crypto/sha2.h>, and make everyone explicitly specify whether they want
the declarations for SHA-1, SHA-2, or both.

This avoids making the SHA-1 declarations visible to files that don't
want anything to do with SHA-1.  It also prepares for potentially moving
sha1.h into a new insecure/ or dangerous/ directory.
Signed-off-by: NEric Biggers <ebiggers@google.com>
Acked-by: NArd Biesheuvel <ardb@kernel.org>
Acked-by: NJason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>

a24d22b2

ipv6: Remove dependency of ipv6_frag_thdr_truncated on ipv6 module · 2d8f6481

由 Georg Kohmann 提交于 11月 19, 2020

IPV6=m
NF_DEFRAG_IPV6=y

ld: net/ipv6/netfilter/nf_conntrack_reasm.o: in function
`nf_ct_frag6_gather':
net/ipv6/netfilter/nf_conntrack_reasm.c:462: undefined reference to
`ipv6_frag_thdr_truncated'

Netfilter is depending on ipv6 symbol ipv6_frag_thdr_truncated. This
dependency is forcing IPV6=y.

Remove this dependency by moving ipv6_frag_thdr_truncated out of ipv6. This
is the same solution as used with a similar issues: Referring to
commit 70b095c8 ("ipv6: remove dependency of nf_defrag_ipv6 on ipv6
module")

Fixes: 9d9e937b ("ipv6/netfilter: Discard first fragment not including all headers")
Reported-by: NRandy Dunlap <rdunlap@infradead.org>
Reported-by: Nkernel test robot <lkp@intel.com>
Signed-off-by: NGeorg Kohmann <geokohma@cisco.com>
Acked-by: NPablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Randy Dunlap <rdunlap@infradead.org> # build-tested
Link: https://lore.kernel.org/r/20201119095833.8409-1-geokohma@cisco.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

2d8f6481

19 11月, 2020 1 次提交

ah6: fix error return code in ah6_input() · a5ebcbdf

由 Zhang Changzhong 提交于 11月 17, 2020

Fix to return a negative error code from the error handling
case instead of 0, as done elsewhere in this function.

Fixes: 1da177e4 ("Linux-2.6.12-rc2")
Reported-by: NHulk Robot <hulkci@huawei.com>
Signed-off-by: NZhang Changzhong <zhangchangzhong@huawei.com>
Link: https://lore.kernel.org/r/1605581105-35295-1-git-send-email-zhangchangzhong@huawei.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

a5ebcbdf

17 11月, 2020 1 次提交

ipv6/netfilter: Discard first fragment not including all headers · 9d9e937b

由 Georg Kohmann 提交于 11月 11, 2020

Packets are processed even though the first fragment don't include all
headers through the upper layer header. This breaks TAHI IPv6 Core
Conformance Test v6LC.1.3.6.

Referring to RFC8200 SECTION 4.5: "If the first fragment does not include
all headers through an Upper-Layer header, then that fragment should be
discarded and an ICMP Parameter Problem, Code 3, message should be sent to
the source of the fragment, with the Pointer field set to zero."

The fragment needs to be validated the same way it is done in
commit 2efdaaaf ("IPv6: reply ICMP error if the first fragment don't
include all headers") for ipv6. Wrap the validation into a common function,
ipv6_frag_thdr_truncated() to check for truncation in the upper layer
header. This validation does not fullfill all aspects of RFC 8200,
section 4.5, but is at the moment sufficient to pass mentioned TAHI test.

In netfilter, utilize the fragment offset returned by find_prev_fhdr() to
let ipv6_frag_thdr_truncated() start it's traverse from the fragment
header.

Return 0 to drop the fragment in the netfilter. This is the same behaviour
as used on other protocol errors in this function, e.g. when
nf_ct_frag6_queue() returns -EPROTO. The Fragment will later be picked up
by ipv6_frag_rcv() in reassembly.c. ipv6_frag_rcv() will then send an
appropriate ICMP Parameter Problem message back to the source.

References commit 2efdaaaf ("IPv6: reply ICMP error if the first
fragment don't include all headers")
Signed-off-by: NGeorg Kohmann <geokohma@cisco.com>
Acked-by: NPablo Neira Ayuso <pablo@netfilter.org>
Link: https://lore.kernel.org/r/20201111115025.28879-1-geokohma@cisco.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

9d9e937b

15 11月, 2020 2 次提交

inet: unexport udp{4|6}_lib_lookup_skb() · 508c4fc2

由 Eric Dumazet 提交于 11月 13, 2020

These functions do not need to be exported.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20201113113553.3411756-1-eric.dumazet@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

508c4fc2

ipv6: remove unused function ipv6_skb_idev() · 2e793878

由 Lukas Bulwahn 提交于 11月 13, 2020

Commit bdb7cc64 ("ipv6: Count interface receive statistics on the
ingress netdev") removed all callees for ipv6_skb_idev(). Hence, since
then, ipv6_skb_idev() is unused and make CC=clang W=1 warns:

net/ipv6/exthdrs.c:909:33:
warning: unused function 'ipv6_skb_idev' [-Wunused-function]

So, remove this unused function and a -Wunused-function warning.
Signed-off-by: NLukas Bulwahn <lukas.bulwahn@gmail.com>
Reviewed-by: NNathan Chancellor <natechancellor@gmail.com>
Link: https://lore.kernel.org/r/20201113135012.32499-1-lukas.bulwahn@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

2e793878

14 11月, 2020 2 次提交

ipv6: Fix error path to cancel the meseage · ceb736e1

由 Zhang Qilong 提交于 11月 12, 2020

genlmsg_cancel() needs to be called in the error path of
inet6_fill_ifmcaddr and inet6_fill_ifacaddr to cancel
the message.

Fixes: 6ecf4c37 ("ipv6: enable IFA_TARGET_NETNSID for RTM_GETADDR")
Reported-by: NHulk Robot <hulkci@huawei.com>
Signed-off-by: NZhang Qilong <zhangqilong3@huawei.com>
Link: https://lore.kernel.org/r/20201112080950.1476302-1-zhangqilong3@huawei.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

ceb736e1

net: Exempt multicast addresses from five-second neighbor lifetime · 8cf8821e

由 Jeff Dike 提交于 11月 12, 2020

Commit 58956317 ("neighbor: Improve garbage collection")
guarantees neighbour table entries a five-second lifetime.  Processes
which make heavy use of multicast can fill the neighour table with
multicast addresses in five seconds.  At that point, neighbour entries
can't be GC-ed because they aren't five seconds old yet, the kernel
log starts to fill up with "neighbor table overflow!" messages, and
sends start to fail.

This patch allows multicast addresses to be thrown out before they've
lived out their five seconds.  This makes room for non-multicast
addresses and makes messages to all addresses more reliable in these
circumstances.

Fixes: 58956317 ("neighbor: Improve garbage collection")
Signed-off-by: NJeff Dike <jdike@akamai.com>
Reviewed-by: NDavid Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20201113015815.31397-1-jdike@akamai.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

8cf8821e

13 11月, 2020 1 次提交

net: udp: fix IP header access and skb lookup on Fast/frag0 UDP GRO · 55e72988

由 Alexander Lobakin 提交于 11月 11, 2020

udp{4,6}_lib_lookup_skb() use ip{,v6}_hdr() to get IP header of the
packet. While it's probably OK for non-frag0 paths, this helpers
will also point to junk on Fast/frag0 GRO when all headers are
located in frags. As a result, sk/skb lookup may fail or give wrong
results. To support both GRO modes, skb_gro_network_header() might
be used. To not modify original functions, add private versions of
udp{4,6}_lib_lookup_skb() only to perform correct sk lookups on GRO.

Present since the introduction of "application-level" UDP GRO
in 4.7-rc1.

Misc: replace totally unneeded ternaries with plain ifs.

Fixes: a6024562 ("udp: Add GRO functions to UDP socket")
Suggested-by: NWillem de Bruijn <willemb@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: NAlexander Lobakin <alobakin@pm.me>
Acked-by: NWillem de Bruijn <willemb@google.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

55e72988

11 11月, 2020 2 次提交

inet: udp{4|6}_lib_lookup_skb() skb argument is const · 7b58e63e

由 Eric Dumazet 提交于 11月 09, 2020

The skb is needed only to fetch the keys for the lookup.

Both functions are used from GRO stack, we do not want
accidental modification of the skb.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Acked-by: NAlexander Lobakin <alobakin@pm.me>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

7b58e63e

net: Update window_clamp if SOCK_RCVBUF is set · 909172a1

由 Mao Wenan 提交于 11月 10, 2020

When net.ipv4.tcp_syncookies=1 and syn flood is happened,
cookie_v4_check or cookie_v6_check tries to redo what
tcp_v4_send_synack or tcp_v6_send_synack did,
rsk_window_clamp will be changed if SOCK_RCVBUF is set,
which will make rcv_wscale is different, the client
still operates with initial window scale and can overshot
granted window, the client use the initial scale but local
server use new scale to advertise window value, and session
work abnormally.

Fixes: e88c64f0 ("tcp: allow effective reduction of TCP's rcv-buffer via setsockopt")
Signed-off-by: NMao Wenan <wenan.mao@linux.alibaba.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/1604967391-123737-1-git-send-email-wenan.mao@linux.alibaba.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

909172a1

10 11月, 2020 3 次提交

ipv4/ipv6: switch to dev_get_tstats64 · 98d7fc46

由 Heiner Kallweit 提交于 11月 07, 2020

Replace ip_tunnel_get_stats64() with the new identical core function
dev_get_tstats64().
Signed-off-by: NHeiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

98d7fc46

vti: switch to dev_get_tstats64 · 8f3feb24

由 Heiner Kallweit 提交于 11月 07, 2020

Replace ip_tunnel_get_stats64() with the new identical core function
dev_get_tstats64().
Signed-off-by: NHeiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

8f3feb24

ip6_tunnel: use ip_tunnel_get_stats64 as ndo_get_stats64 callback · 6b840a04

由 Heiner Kallweit 提交于 11月 07, 2020

Switch ip6_tunnel to the standard statistics pattern:
- use dev->stats for the less frequently accessed counters
- use dev->tstats for the frequently accessed counters

An additional benefit is that we now have 64bit statistics also on
32bit systems.
Signed-off-by: NHeiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

6b840a04

openeuler / Kernel 接近 2 年 前同步成功

openeuler / Kernel
接近 2 年前同步成功