提交 · c457985aaa92e1fda2ce837cabf90bf687b92dcb · openeuler / Kernel

19 8月, 2022 3 次提交

tcp: fix tcp_cleanup_rbuf() for tcp_read_skb() · c457985a

由 Cong Wang 提交于 8月 17, 2022

tcp_cleanup_rbuf() retrieves the skb from sk_receive_queue, it
assumes the skb is not yet dequeued. This is no longer true for
tcp_read_skb() case where we dequeue the skb first.

Fix this by introducing a helper __tcp_cleanup_rbuf() which does
not require any skb and calling it in tcp_read_skb().

Fixes: 04919bed ("tcp: Introduce tcp_read_skb()")
Cc: Eric Dumazet <edumazet@google.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: NCong Wang <cong.wang@bytedance.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

c457985a

tcp: fix sock skb accounting in tcp_read_skb() · e9c6e797

由 Cong Wang 提交于 8月 17, 2022

Before commit 965b57b4 ("net: Introduce a new proto_ops
->read_skb()"), skb was not dequeued from receive queue hence
when we close TCP socket skb can be just flushed synchronously.

After this commit, we have to uncharge skb immediately after being
dequeued, otherwise it is still charged in the original sock. And we
still need to retain skb->sk, as eBPF programs may extract sock
information from skb->sk. Therefore, we have to call
skb_set_owner_sk_safe() here.

Fixes: 965b57b4 ("net: Introduce a new proto_ops ->read_skb()")
Reported-and-tested-by: syzbot+a0e6f8738b58f7654417@syzkaller.appspotmail.com
Tested-by: NStanislav Fomichev <sdf@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: NCong Wang <cong.wang@bytedance.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

e9c6e797

net: genl: fix error path memory leak in policy dumping · 24980136

由 Jakub Kicinski 提交于 8月 16, 2022

If construction of the array of policies fails when recording
non-first policy we need to unwind.

netlink_policy_dump_add_policy() itself also needs fixing as
it currently gives up on error without recording the allocated
pointer in the pstate pointer.

Reported-by: syzbot+dc54d9ba8153b216cae0@syzkaller.appspotmail.com
Fixes: 50a896cf ("genetlink: properly support per-op policy dumping")
Link: https://lore.kernel.org/r/20220816161939.577583-1-kuba@kernel.orgSigned-off-by: NJakub Kicinski <kuba@kernel.org>

24980136

18 8月, 2022 1 次提交

net: dsa: don't warn in dsa_port_set_state_now() when driver doesn't support it · 211987f3

由 Vladimir Oltean 提交于 8月 16, 2022

ds->ops->port_stp_state_set() is, like most DSA methods, optional, and
if absent, the port is supposed to remain in the forwarding state (as
standalone). Such is the case with the mv88e6060 driver, which does not
offload the bridge layer. DSA warns that the STP state can't be changed
to FORWARDING as part of dsa_port_enable_rt(), when in fact it should not.

The error message is also not up to modern standards, so take the
opportunity to make it more descriptive.

Fixes: fd364541 ("net: dsa: change scope of STP state setter")
Reported-by: NSergei Antonov <saproj@gmail.com>
Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: NSergei Antonov <saproj@gmail.com>
Link: https://lore.kernel.org/r/20220816201445.1809483-1-vladimir.oltean@nxp.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

211987f3

17 8月, 2022 3 次提交

tls: rx: react to strparser initialization errors · 849f16bb

由 Jakub Kicinski 提交于 8月 15, 2022

Even though the normal strparser's init function has a return
value we got away with ignoring errors until now, as it only
validates the parameters and we were passing correct parameters.

tls_strp can fail to init on memory allocation errors, which
syzbot duly induced and reported.

Reported-by: syzbot+abd45eb849b05194b1b6@syzkaller.appspotmail.com
Fixes: 84c61fe1 ("tls: rx: do not use the standard strparser")
Signed-off-by: NJakub Kicinski <kuba@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

849f16bb

netfilter: conntrack: NF_CONNTRACK_PROCFS should no longer default to y · aa5762c3

由 Geert Uytterhoeven 提交于 8月 15, 2022

NF_CONNTRACK_PROCFS was marked obsolete in commit 54b07dca
("netfilter: provide config option to disable ancient procfs parts") in
v3.3.
Signed-off-by: NGeert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: NFlorian Westphal <fw@strlen.de>

aa5762c3

net: sched: fix misuse of qcpu->backlog in gnet_stats_add_queue_cpu · de64b6b6

由 Zhengchao Shao 提交于 8月 15, 2022

In the gnet_stats_add_queue_cpu function, the qstats->qlen statistics
are incorrectly set to qcpu->backlog.

Fixes: 448e163f ("gen_stats: Add gnet_stats_add_queue()")
Signed-off-by: NZhengchao Shao <shaozhengchao@huawei.com>
Link: https://lore.kernel.org/r/20220815030848.276746-1-shaozhengchao@huawei.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

de64b6b6

16 8月, 2022 2 次提交

net: rtnetlink: fix module reference count leak issue in rtnetlink_rcv_msg · 5b22f627

由 Zhengchao Shao 提交于 8月 15, 2022

When bulk delete command is received in the rtnetlink_rcv_msg function,
if bulk delete is not supported, module_put is not called to release
the reference counting. As a result, module reference count is leaked.

Fixes: a6cec0bc ("net: rtnetlink: add bulk delete support flag")
Signed-off-by: NZhengchao Shao <shaozhengchao@huawei.com>
Acked-by: NNikolay Aleksandrov <razor@blackwall.org>
Link: https://lore.kernel.org/r/20220815024629.240367-1-shaozhengchao@huawei.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

5b22f627

netfilter: nf_tables: check NFT_SET_CONCAT flag if field_count is specified · 1b6345d4

由 Pablo Neira Ayuso 提交于 8月 15, 2022

Since f3a2181e ("netfilter: nf_tables: Support for sets with
multiple ranged fields"), it possible to combine intervals and
concatenations. Later on, ef516e86 ("netfilter: nf_tables:
reintroduce the NFT_SET_CONCAT flag") provides the NFT_SET_CONCAT flag
for userspace to report that the set stores a concatenation.

Make sure NFT_SET_CONCAT is set on if field_count is specified for
consistency. Otherwise, if NFT_SET_CONCAT is specified with no
field_count, bail out with EINVAL.

Fixes: ef516e86 ("netfilter: nf_tables: reintroduce the NFT_SET_CONCAT flag")
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

1b6345d4

15 8月, 2022 7 次提交

netfilter: nf_tables: disallow NFT_SET_ELEM_CATCHALL and NFT_SET_ELEM_INTERVAL_END · fc0ae524

由 Pablo Neira Ayuso 提交于 8月 13, 2022

These flags are mutually exclusive, report EINVAL in this case.

Fixes: aaa31047 ("netfilter: nftables: add catch-all set element support")
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

fc0ae524

netfilter: nf_tables: NFTA_SET_ELEM_KEY_END requires concat and interval flags · 88cccd90

由 Pablo Neira Ayuso 提交于 8月 12, 2022

If the NFT_SET_CONCAT|NFT_SET_INTERVAL flags are set on, then the
netlink attribute NFTA_SET_ELEM_KEY_END must be specified. Otherwise,
NFTA_SET_ELEM_KEY_END should not be present.

For catch-all element, NFTA_SET_ELEM_KEY_END should not be present.
The NFT_SET_ELEM_INTERVAL_END is never used with this set flags
combination.

Fixes: 7b225d0b ("netfilter: nf_tables: add NFTA_SET_ELEM_KEY_END attribute")
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

88cccd90

net_sched: cls_route: disallow handle of 0 · 02799571

由 Jamal Hadi Salim 提交于 8月 14, 2022

Follows up on:
https://lore.kernel.org/all/20220809170518.164662-1-cascardo@canonical.com/

handle of 0 implies from/to of universe realm which is not very
sensible.

Lets see what this patch will do:
$sudo tc qdisc add dev $DEV root handle 1:0 prio

//lets manufacture a way to insert handle of 0
$sudo tc filter add dev $DEV parent 1:0 protocol ip prio 100 \
route to 0 from 0 classid 1:10 action ok

//gets rejected...
Error: handle of 0 is not valid.
We have an error talking to the kernel, -1

//lets create a legit entry..
sudo tc filter add dev $DEV parent 1:0 protocol ip prio 100 route from 10 \
classid 1:10 action ok

//what did the kernel insert?
$sudo tc filter ls dev $DEV parent 1:0
filter protocol ip pref 100 route chain 0
filter protocol ip pref 100 route chain 0 fh 0x000a8000 flowid 1:10 from 10
	action order 1: gact action pass
	 random type none pass val 0
	 index 1 ref 1 bind 1

//Lets try to replace that legit entry with a handle of 0
$ sudo tc filter replace dev $DEV parent 1:0 protocol ip prio 100 \
handle 0x000a8000 route to 0 from 0 classid 1:10 action drop

Error: Replacing with handle of 0 is invalid.
We have an error talking to the kernel, -1

And last, lets run Cascardo's POC:
$ ./poc
0
0
-22
-22
-22
Signed-off-by: NJamal Hadi Salim <jhs@mojatatu.com>
Acked-by: NStephen Hemminger <stephen@networkplumber.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

02799571

net: fix potential refcount leak in ndisc_router_discovery() · 7396ba87

由 Xin Xiong 提交于 8月 13, 2022

The issue happens on specific paths in the function. After both the
object `rt` and `neigh` are grabbed successfully, when `lifetime` is
nonzero but the metric needs change, the function just deletes the
route and set `rt` to NULL. Then, it may try grabbing `rt` and `neigh`
again if above conditions hold. The function simply overwrite `neigh`
if succeeds or returns if fails, without decreasing the reference
count of previous `neigh`. This may result in memory leaks.

Fix it by decrementing the reference count of `neigh` in place.

Fixes: 6b2e04bc ("net: allow user to set metric on default route learned via Router Advertisement")
Signed-off-by: NXin Xiong <xiongx18@fudan.edu.cn>
Signed-off-by: NXin Tan <tanxin.ctf@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7396ba87

neighbour: make proxy_queue.qlen limit per-device · 0ff4eb3d

由 Alexander Mikhalitsyn 提交于 8月 11, 2022

Right now we have a neigh_param PROXY_QLEN which specifies maximum length
of neigh_table->proxy_queue. But in fact, this limitation doesn't work well
because check condition looks like:
tbl->proxy_queue.qlen > NEIGH_VAR(p, PROXY_QLEN)

The problem is that p (struct neigh_parms) is a per-device thing,
but tbl (struct neigh_table) is a system-wide global thing.

It seems reasonable to make proxy_queue limit per-device based.

v2:
	- nothing changed in this patch
v3:
	- rebase to net tree

Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: David Ahern <dsahern@kernel.org>
Cc: Yajun Deng <yajun.deng@linux.dev>
Cc: Roopa Prabhu <roopa@nvidia.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
Cc: Konstantin Khorenko <khorenko@virtuozzo.com>
Cc: kernel@openvz.org
Cc: devel@openvz.org
Suggested-by: NDenis V. Lunev <den@openvz.org>
Signed-off-by: NAlexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
Reviewed-by: NDenis V. Lunev <den@openvz.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0ff4eb3d

neigh: fix possible DoS due to net iface start/stop loop · 66ba215c

由 Denis V. Lunev 提交于 8月 11, 2022

Normal processing of ARP request (usually this is Ethernet broadcast
packet) coming to the host is looking like the following:
* the packet comes to arp_process() call and is passed through routing
  procedure
* the request is put into the queue using pneigh_enqueue() if
  corresponding ARP record is not local (common case for container
  records on the host)
* the request is processed by timer (within 80 jiffies by default) and
  ARP reply is sent from the same arp_process() using
  NEIGH_CB(skb)->flags & LOCALLY_ENQUEUED condition (flag is set inside
  pneigh_enqueue())

And here the problem comes. Linux kernel calls pneigh_queue_purge()
which destroys the whole queue of ARP requests on ANY network interface
start/stop event through __neigh_ifdown().

This is actually not a problem within the original world as network
interface start/stop was accessible to the host 'root' only, which
could do more destructive things. But the world is changed and there
are Linux containers available. Here container 'root' has an access
to this API and could be considered as untrusted user in the hosting
(container's) world.

Thus there is an attack vector to other containers on node when
container's root will endlessly start/stop interfaces. We have observed
similar situation on a real production node when docker container was
doing such activity and thus other containers on the node become not
accessible.

The patch proposed doing very simple thing. It drops only packets from
the same namespace in the pneigh_queue_purge() where network interface
state change is detected. This is enough to prevent the problem for the
whole node preserving original semantics of the code.

v2:
	- do del_timer_sync() if queue is empty after pneigh_queue_purge()
v3:
	- rebase to net tree

Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: David Ahern <dsahern@kernel.org>
Cc: Yajun Deng <yajun.deng@linux.dev>
Cc: Roopa Prabhu <roopa@nvidia.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
Cc: Konstantin Khorenko <khorenko@virtuozzo.com>
Cc: kernel@openvz.org
Cc: devel@openvz.org
Investigated-by: NAlexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
Signed-off-by: NDenis V. Lunev <den@openvz.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

66ba215c

net: qrtr: start MHI channel after endpoit creation · 68a838b8

由 Maxim Kochetkov 提交于 8月 11, 2022

MHI channel may generates event/interrupt right after enabling.
It may leads to 2 race conditions issues.

1)
Such event may be dropped by qcom_mhi_qrtr_dl_callback() at check:

	if (!qdev || mhi_res->transaction_status)
		return;

Because dev_set_drvdata(&mhi_dev->dev, qdev) may be not performed at
this moment. In this situation qrtr-ns will be unable to enumerate
services in device.
---------------------------------------------------------------

2)
Such event may come at the moment after dev_set_drvdata() and
before qrtr_endpoint_register(). In this case kernel will panic with
accessing wrong pointer at qcom_mhi_qrtr_dl_callback():

	rc = qrtr_endpoint_post(&qdev->ep, mhi_res->buf_addr,
				mhi_res->bytes_xferd);

Because endpoint is not created yet.
--------------------------------------------------------------
So move mhi_prepare_for_transfer_autoqueue after endpoint creation
to fix it.

Fixes: a2e2cc0d ("net: qrtr: Start MHI channels during init")
Signed-off-by: NMaxim Kochetkov <fido_max@inbox.ru>
Reviewed-by: NHemant Kumar <quic_hemantk@quicinc.com>
Reviewed-by: NManivannan Sadhasivam <mani@kernel.org>
Reviewed-by: NLoic Poulain <loic.poulain@linaro.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

68a838b8

13 8月, 2022 1 次提交

ip6_tunnel: Fix the type of functions · 77788567

由 Hongbin Wang 提交于 8月 11, 2022

Functions ip6_tnl_change, ip6_tnl_update and ip6_tnl0_update do always
return 0, change the type of functions to void.
Signed-off-by: NHongbin Wang <wh_bin@126.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

77788567

12 8月, 2022 5 次提交

netfilter: nf_tables: validate NFTA_SET_ELEM_OBJREF based on NFT_SET_OBJECT flag · 5a2f3dc3

由 Pablo Neira Ayuso 提交于 8月 12, 2022

If the NFTA_SET_ELEM_OBJREF netlink attribute is present and
NFT_SET_OBJECT flag is set on, report EINVAL.

Move existing sanity check earlier to validate that NFT_SET_OBJECT
requires NFTA_SET_ELEM_OBJREF.

Fixes: 8aeff920 ("netfilter: nf_tables: add stateful object reference to set elements")
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

5a2f3dc3

net/sunrpc: fix potential memory leaks in rpc_sysfs_xprt_state_change() · bfc48f1b

由 Xin Xiong 提交于 8月 10, 2022

The issue happens on some error handling paths. When the function
fails to grab the object `xprt`, it simply returns 0, forgetting to
decrease the reference count of another object `xps`, which is
increased by rpc_sysfs_xprt_kobj_get_xprt_switch(), causing refcount
leaks. Also, the function forgets to check whether `xps` is valid
before using it, which may result in NULL-dereferencing issues.

Fix it by adding proper error handling code when either `xprt` or
`xps` is NULL.

Fixes: 5b7eb784 ("SUNRPC: take a xprt offline using sysfs")
Signed-off-by: NXin Xiong <xiongx18@fudan.edu.cn>
Signed-off-by: NXin Tan <tanxin.ctf@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

bfc48f1b

rds: add missing barrier to release_refill · 9f414eb4

由 Mikulas Patocka 提交于 8月 10, 2022

The functions clear_bit and set_bit do not imply a memory barrier, thus it
may be possible that the waitqueue_active function (which does not take
any locks) is moved before clear_bit and it could miss a wakeup event.

Fix this bug by adding a memory barrier after clear_bit.
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9f414eb4

netfilter: nf_tables: really skip inactive sets when allocating name · 271c5ca8

由 Pablo Neira Ayuso 提交于 8月 09, 2022

While looping to build the bitmap of used anonymous set names, check the
current set in the iteration, instead of the one that is being created.

Fixes: 37a9cc52 ("netfilter: nf_tables: add generation mask to sets")
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

271c5ca8

netfilter: nfnetlink: re-enable conntrack expectation events · 0b2f3212

由 Florian Westphal 提交于 8月 05, 2022

To avoid allocation of the conntrack extension area when possible,
the default behaviour was changed to only allocate the event extension
if a userspace program is subscribed to a notification group.

Problem is that while 'conntrack -E' does enable the event allocation
behind the scenes, 'conntrack -E expect' does not: no expectation events
are delivered unless user sets
"net.netfilter.nf_conntrack_events" back to 1 (always on).

Fix the autodetection to also consider EXP type group.

We need to track the 6 event groups (3+3, new/update/destroy for events and
for expectations each) independently, else we'd disable events again
if an expectation group becomes empty while there is still an active
event group.

Fixes: 2794cdb0 ("netfilter: nfnetlink: allow to detect if ctnetlink listeners exist")
Reported-by: NYi Chen <yiche@redhat.com>
Signed-off-by: NFlorian Westphal <fw@strlen.de>

0b2f3212

11 8月, 2022 13 次提交

netfilter: nf_tables: fix scheduling-while-atomic splat · 2024439b

由 Florian Westphal 提交于 8月 11, 2022

nf_tables_check_loops() can be called from rhashtable list
walk so cond_resched() cannot be used here.

Fixes: 81ea0106 ("netfilter: nf_tables: add rescheduling points during loop detection walks")
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

2024439b

netfilter: nf_ct_irc: cap packet search space to 4k · 976bf59c

由 Florian Westphal 提交于 8月 09, 2022

This uses a pseudo-linearization scheme with a 64k global buffer,
but BIG TCP arrival means IPv6 TCP stack can generate skbs
that exceed this size.

In practice, IRC commands are not expected to exceed 512 bytes, plus
this is interactive protocol, so we should not see large packets
in practice.

Given most IRC connections nowadays use TLS so this helper could also be
removed in the near future.

Fixes: 7c4e983c ("net: allow gso_max_size to exceed 65536")
Fixes: 0fe79f28 ("net: allow gro_max_size to exceed 65536")
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

976bf59c

netfilter: nf_ct_ftp: prefer skb_linearize · c783a29c

由 Florian Westphal 提交于 8月 09, 2022

This uses a pseudo-linearization scheme with a 64k global buffer,
but BIG TCP arrival means IPv6 TCP stack can generate skbs
that exceed this size.

Use skb_linearize.  It should be possible to rewrite this to properly
deal with segmented skbs (i.e., only do small chunk-wise accesses),
but this is going to be a lot more intrusive than this because every
helper function needs to get the sk_buff instead of a pointer to a raw
data buffer.

In practice, provided we're really looking at FTP control channel packets,
there should never be a case where we deal with huge packets.

Fixes: 7c4e983c ("net: allow gso_max_size to exceed 65536")
Fixes: 0fe79f28 ("net: allow gro_max_size to exceed 65536")
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

c783a29c

netfilter: nf_ct_h323: cap packet size at 64k · f3e124c3

由 Florian Westphal 提交于 8月 09, 2022

With BIG TCP, packets generated by tcp stack may exceed 64kb.
Cap datalen at 64kb.  The internal message format uses 16bit fields,
so no embedded message can exceed 64k size.

Multiple h323 messages in a single superpacket may now result
in a message to get treated as incomplete/truncated, but thats
better than scribbling past h323_buffer.

Another alternative suitable for net tree would be a switch to
skb_linearize().

Fixes: 7c4e983c ("net: allow gso_max_size to exceed 65536")
Fixes: 0fe79f28 ("net: allow gro_max_size to exceed 65536")
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

f3e124c3

netfilter: nf_ct_sane: remove pseudo skb linearization · a664375d

由 Florian Westphal 提交于 8月 09, 2022

For historical reason this code performs pseudo linearization of skbs
via skb_header_pointer and a global 64k buffer.

With arrival of BIG TCP, packets generated by TCP stack can exceed 64kb.

Rewrite this to only extract the needed header data.  This also allows
to get rid of the locking.

Fixes: 7c4e983c ("net: allow gso_max_size to exceed 65536")
Fixes: 0fe79f28 ("net: allow gro_max_size to exceed 65536")
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

a664375d

net/tls: Use RCU API to access tls_ctx->netdev · 94ce3b64

由 Maxim Mikityanskiy 提交于 8月 10, 2022

Currently, tls_device_down synchronizes with tls_device_resync_rx using
RCU, however, the pointer to netdev is stored using WRITE_ONCE and
loaded using READ_ONCE.

Although such approach is technically correct (rcu_dereference is
essentially a READ_ONCE, and rcu_assign_pointer uses WRITE_ONCE to store
NULL), using special RCU helpers for pointers is more valid, as it
includes additional checks and might change the implementation
transparently to the callers.

Mark the netdev pointer as __rcu and use the correct RCU helpers to
access it. For non-concurrent access pass the right conditions that
guarantee safe access (locks taken, refcount value). Also use the
correct helper in mlx5e, where even READ_ONCE was missing.

The transition to RCU exposes existing issues, fixed by this commit:

1. bond_tls_device_xmit could read netdev twice, and it could become
NULL the second time, after the NULL check passed.

2. Drivers shouldn't stop processing the last packet if tls_device_down
just set netdev to NULL, before tls_dev_del was called. This prevents a
possible packet drop when transitioning to the fallback software mode.

Fixes: 89df6a81 ("net/bonding: Implement TLS TX device offload")
Fixes: c55dcdd4 ("net/tls: Fix use-after-free after the TLS device goes down and up")
Signed-off-by: NMaxim Mikityanskiy <maximmi@nvidia.com>
Link: https://lore.kernel.org/r/20220810081602.1435800-1-maximmi@nvidia.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

94ce3b64

tls: rx: device: don't try to copy too much on detach · d800a7b3

由 Jakub Kicinski 提交于 8月 09, 2022

Another device offload bug, we use the length of the output
skb as an indication of how much data to copy. But that skb
is sized to offset + record length, and we start from offset.
So we end up double-counting the offset which leads to
skb_copy_bits() returning -EFAULT.
Reported-by: NTariq Toukan <tariqt@nvidia.com>
Fixes: 84c61fe1 ("tls: rx: do not use the standard strparser")
Tested-by: NRan Rozenstein <ranro@nvidia.com>
Link: https://lore.kernel.org/r/20220809175544.354343-2-kuba@kernel.orgSigned-off-by: NJakub Kicinski <kuba@kernel.org>

d800a7b3

tls: rx: device: bound the frag walk · 86b259f6

由 Jakub Kicinski 提交于 8月 09, 2022

We can't do skb_walk_frags() on the input skbs, because
the input skbs is really just a pointer to the tcp read
queue. We need to bound the "is decrypted" check by the
amount of data in the message.

Note that the walk in tls_device_reencrypt() is after a
CoW so the skb there is safe to walk. Actually in the
current implementation it can't have frags at all, but
whatever, maybe one day it will.
Reported-by: NTariq Toukan <tariqt@nvidia.com>
Fixes: 84c61fe1 ("tls: rx: do not use the standard strparser")
Tested-by: NRan Rozenstein <ranro@nvidia.com>
Link: https://lore.kernel.org/r/20220809175544.354343-1-kuba@kernel.orgSigned-off-by: NJakub Kicinski <kuba@kernel.org>

86b259f6

net_sched: cls_route: remove from list when handle is 0 · 9ad36309

由 Thadeu Lima de Souza Cascardo 提交于 8月 09, 2022

When a route filter is replaced and the old filter has a 0 handle, the old
one won't be removed from the hashtable, while it will still be freed.

The test was there since before commit 1109c005 ("net: sched: RCU
cls_route"), when a new filter was not allocated when there was an old one.
The old filter was reused and the reinserting would only be necessary if an
old filter was replaced. That was still wrong for the same case where the
old handle was 0.

Remove the old filter from the list independently from its handle value.

This fixes CVE-2022-2588, also reported as ZDI-CAN-17440.
Reported-by: NZhenpeng Lin <zplin@u.northwestern.edu>
Signed-off-by: NThadeu Lima de Souza Cascardo <cascardo@canonical.com>
Reviewed-by: NKamal Mostafa <kamal@canonical.com>
Cc: <stable@vger.kernel.org>
Acked-by: NJamal Hadi Salim <jhs@mojatatu.com>
Link: https://lore.kernel.org/r/20220809170518.164662-1-cascardo@canonical.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

9ad36309

net: fix refcount bug in sk_psock_get (2) · 2a013372

由 Hawkins Jiawei 提交于 8月 05, 2022

Syzkaller reports refcount bug as follows:
------------[ cut here ]------------
refcount_t: saturated; leaking memory.
WARNING: CPU: 1 PID: 3605 at lib/refcount.c:19 refcount_warn_saturate+0xf4/0x1e0 lib/refcount.c:19
Modules linked in:
CPU: 1 PID: 3605 Comm: syz-executor208 Not tainted 5.18.0-syzkaller-03023-g7e062cda #0
 <TASK>
 __refcount_add_not_zero include/linux/refcount.h:163 [inline]
 __refcount_inc_not_zero include/linux/refcount.h:227 [inline]
 refcount_inc_not_zero include/linux/refcount.h:245 [inline]
 sk_psock_get+0x3bc/0x410 include/linux/skmsg.h:439
 tls_data_ready+0x6d/0x1b0 net/tls/tls_sw.c:2091
 tcp_data_ready+0x106/0x520 net/ipv4/tcp_input.c:4983
 tcp_data_queue+0x25f2/0x4c90 net/ipv4/tcp_input.c:5057
 tcp_rcv_state_process+0x1774/0x4e80 net/ipv4/tcp_input.c:6659
 tcp_v4_do_rcv+0x339/0x980 net/ipv4/tcp_ipv4.c:1682
 sk_backlog_rcv include/net/sock.h:1061 [inline]
 __release_sock+0x134/0x3b0 net/core/sock.c:2849
 release_sock+0x54/0x1b0 net/core/sock.c:3404
 inet_shutdown+0x1e0/0x430 net/ipv4/af_inet.c:909
 __sys_shutdown_sock net/socket.c:2331 [inline]
 __sys_shutdown_sock net/socket.c:2325 [inline]
 __sys_shutdown+0xf1/0x1b0 net/socket.c:2343
 __do_sys_shutdown net/socket.c:2351 [inline]
 __se_sys_shutdown net/socket.c:2349 [inline]
 __x64_sys_shutdown+0x50/0x70 net/socket.c:2349
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x46/0xb0
 </TASK>

During SMC fallback process in connect syscall, kernel will
replaces TCP with SMC. In order to forward wakeup
smc socket waitqueue after fallback, kernel will sets
clcsk->sk_user_data to origin smc socket in
smc_fback_replace_callbacks().

Later, in shutdown syscall, kernel will calls
sk_psock_get(), which treats the clcsk->sk_user_data
as psock type, triggering the refcnt warning.

So, the root cause is that smc and psock, both will use
sk_user_data field. So they will mismatch this field
easily.

This patch solves it by using another bit(defined as
SK_USER_DATA_PSOCK) in PTRMASK, to mark whether
sk_user_data points to a psock object or not.
This patch depends on a PTRMASK introduced in commit f1ff5ce2
("net, sk_msg: Clear sk_user_data pointer on clone if tagged").

For there will possibly be more flags in the sk_user_data field,
this patch also refactor sk_user_data flags code to be more generic
to improve its maintainability.

Reported-and-tested-by: syzbot+5f26f85569bd179c18ce@syzkaller.appspotmail.com
Suggested-by: NJakub Kicinski <kuba@kernel.org>
Acked-by: NWen Gu <guwen@linux.alibaba.com>
Signed-off-by: NHawkins Jiawei <yin31149@gmail.com>
Reviewed-by: NJakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

2a013372

bpf: Check the validity of max_rdwr_access for sock local storage map iterator · 52bd05eb

由 Hou Tao 提交于 8月 10, 2022

The value of sock local storage map is writable in map iterator, so check
max_rdwr_access instead of max_rdonly_access.

Fixes: 5ce6e77c ("bpf: Implement bpf iterator for sock local storage map")
Signed-off-by: NHou Tao <houtao1@huawei.com>
Acked-by: NYonghong Song <yhs@fb.com>
Acked-by: NMartin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/r/20220810080538.1845898-6-houtao@huaweicloud.comSigned-off-by: NAlexei Starovoitov <ast@kernel.org>

52bd05eb

bpf: Acquire map uref in .init_seq_private for sock{map,hash} iterator · f0d2b271

由 Hou Tao 提交于 8月 10, 2022

sock_map_iter_attach_target() acquires a map uref, and the uref may be
released before or in the middle of iterating map elements. For example,
the uref could be released in sock_map_iter_detach_target() as part of
bpf_link_release(), or could be released in bpf_map_put_with_uref() as
part of bpf_map_release().

Fixing it by acquiring an extra map uref in .init_seq_private and
releasing it in .fini_seq_private.

Fixes: 03653515 ("net: Allow iterating sockmap and sockhash")
Signed-off-by: NHou Tao <houtao1@huawei.com>
Acked-by: NYonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20220810080538.1845898-5-houtao@huaweicloud.comSigned-off-by: NAlexei Starovoitov <ast@kernel.org>

f0d2b271

bpf: Acquire map uref in .init_seq_private for sock local storage map iterator · 3c5f6e69

由 Hou Tao 提交于 8月 10, 2022

bpf_iter_attach_map() acquires a map uref, and the uref may be released
before or in the middle of iterating map elements. For example, the uref
could be released in bpf_iter_detach_map() as part of
bpf_link_release(), or could be released in bpf_map_put_with_uref() as
part of bpf_map_release().

So acquiring an extra map uref in bpf_iter_init_sk_storage_map() and
releasing it in bpf_iter_fini_sk_storage_map().

Fixes: 5ce6e77c ("bpf: Implement bpf iterator for sock local storage map")
Signed-off-by: NHou Tao <houtao1@huawei.com>
Acked-by: NYonghong Song <yhs@fb.com>
Acked-by: NMartin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/r/20220810080538.1845898-4-houtao@huaweicloud.comSigned-off-by: NAlexei Starovoitov <ast@kernel.org>

3c5f6e69

10 8月, 2022 5 次提交

netfilter: nf_tables: possible module reference underflow in error path · c485c35f

由 Pablo Neira Ayuso 提交于 8月 09, 2022

dst->ops is set on when nft_expr_clone() fails, but module refcount has
not been bumped yet, therefore nft_expr_destroy() leads to module
reference underflow.

Fixes: 8cfd9b0f ("netfilter: nftables: generalize set expressions support")
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

c485c35f

netfilter: nf_tables: disallow NFTA_SET_ELEM_KEY_END with NFT_SET_ELEM_INTERVAL_END flag · 4963674c

由 Pablo Neira Ayuso 提交于 8月 09, 2022

These are mutually exclusive, actually NFTA_SET_ELEM_KEY_END replaces
the flag notation.

Fixes: 7b225d0b ("netfilter: nf_tables: add NFTA_SET_ELEM_KEY_END attribute")
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

4963674c

netfilter: nf_tables: use READ_ONCE and WRITE_ONCE for shared generation id access · 34002783

由 Pablo Neira Ayuso 提交于 8月 09, 2022

The generation ID is bumped from the commit path while holding the
mutex, however, netlink dump operations rely on RCU.

This patch also adds missing cb->base_eq initialization in
nf_tables_dump_set().

Fixes: 38e029f1 ("netfilter: nf_tables: set NLM_F_DUMP_INTR if netlink dumping is stale")
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

34002783

devlink: Fix use-after-free after a failed reload · 6b4db2e5

由 Ido Schimmel 提交于 8月 09, 2022

After a failed devlink reload, devlink parameters are still registered,
which means user space can set and get their values. In the case of the
mlxsw "acl_region_rehash_interval" parameter, these operations will
trigger a use-after-free [1].

Fix this by rejecting set and get operations while in the failed state.
Return the "-EOPNOTSUPP" error code which does not abort the parameters
dump, but instead causes it to skip over the problematic parameter.

Another possible fix is to perform these checks in the mlxsw parameter
callbacks, but other drivers might be affected by the same problem and I
am not aware of scenarios where these stricter checks will cause a
regression.

[1]
mlxsw_spectrum3 0000:00:10.0: Port 125: Failed to register netdev
mlxsw_spectrum3 0000:00:10.0: Failed to create ports

==================================================================
BUG: KASAN: use-after-free in mlxsw_sp_acl_tcam_vregion_rehash_intrvl_get+0xbd/0xd0 drivers/net/ethernet/mellanox/mlxsw/spectrum_acl_tcam.c:904
Read of size 4 at addr ffff8880099dcfd8 by task kworker/u4:4/777

CPU: 1 PID: 777 Comm: kworker/u4:4 Not tainted 5.19.0-rc7-custom-126601-gfe26f28c586d #1
Hardware name: QEMU MSN4700, BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Workqueue: netns cleanup_net
Call Trace:
 <TASK>
 __dump_stack lib/dump_stack.c:88 [inline]
 dump_stack_lvl+0x92/0xbd lib/dump_stack.c:106
 print_address_description mm/kasan/report.c:313 [inline]
 print_report.cold+0x5e/0x5cf mm/kasan/report.c:429
 kasan_report+0xb9/0xf0 mm/kasan/report.c:491
 __asan_report_load4_noabort+0x14/0x20 mm/kasan/report_generic.c:306
 mlxsw_sp_acl_tcam_vregion_rehash_intrvl_get+0xbd/0xd0 drivers/net/ethernet/mellanox/mlxsw/spectrum_acl_tcam.c:904
 mlxsw_sp_acl_region_rehash_intrvl_get+0x49/0x60 drivers/net/ethernet/mellanox/mlxsw/spectrum_acl.c:1106
 mlxsw_sp_params_acl_region_rehash_intrvl_get+0x33/0x80 drivers/net/ethernet/mellanox/mlxsw/spectrum.c:3854
 devlink_param_get net/core/devlink.c:4981 [inline]
 devlink_nl_param_fill+0x238/0x12d0 net/core/devlink.c:5089
 devlink_param_notify+0xe5/0x230 net/core/devlink.c:5168
 devlink_ns_change_notify net/core/devlink.c:4417 [inline]
 devlink_ns_change_notify net/core/devlink.c:4396 [inline]
 devlink_reload+0x15f/0x700 net/core/devlink.c:4507
 devlink_pernet_pre_exit+0x112/0x1d0 net/core/devlink.c:12272
 ops_pre_exit_list net/core/net_namespace.c:152 [inline]
 cleanup_net+0x494/0xc00 net/core/net_namespace.c:582
 process_one_work+0x9fc/0x1710 kernel/workqueue.c:2289
 worker_thread+0x675/0x10b0 kernel/workqueue.c:2436
 kthread+0x30c/0x3d0 kernel/kthread.c:376
 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:306
 </TASK>

The buggy address belongs to the physical page:
page:ffffea0000267700 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x99dc
flags: 0x100000000000000(node=0|zone=1)
raw: 0100000000000000 0000000000000000 dead000000000122 0000000000000000
raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 ffff8880099dce80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
 ffff8880099dcf00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>ffff8880099dcf80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
                                                    ^
 ffff8880099dd000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
 ffff8880099dd080: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
==================================================================

Fixes: 98bbf70c ("mlxsw: spectrum: add "acl_region_rehash_interval" devlink param")
Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
Reviewed-by: NJiri Pirko <jiri@nvidia.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6b4db2e5

vsock: Set socket state back to SS_UNCONNECTED in vsock_connect_timeout() · a3e7b29e

由 Peilin Ye 提交于 8月 08, 2022

Imagine two non-blocking vsock_connect() requests on the same socket.
The first request schedules @connect_work, and after it times out,
vsock_connect_timeout() sets *sock* state back to TCP_CLOSE, but keeps
*socket* state as SS_CONNECTING.

Later, the second request returns -EALREADY, meaning the socket "already
has a pending connection in progress", even though the first request has
already timed out.

As suggested by Stefano, fix it by setting *socket* state back to
SS_UNCONNECTED, so that the second request will return -ETIMEDOUT.
Suggested-by: NStefano Garzarella <sgarzare@redhat.com>
Fixes: d021c344 ("VSOCK: Introduce VM Sockets")
Reviewed-by: NStefano Garzarella <sgarzare@redhat.com>
Signed-off-by: NPeilin Ye <peilin.ye@bytedance.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a3e7b29e

openeuler / Kernel 接近 2 年 前同步成功

openeuler / Kernel
接近 2 年前同步成功