提交 · 1581a6c1c3291a8320b080f4411345f60229976d · openeuler / Kernel

21 6月, 2021 4 次提交

skmsg: Teach sk_psock_verdict_apply() to return errors · 1581a6c1

由 Cong Wang 提交于 6月 14, 2021

Currently sk_psock_verdict_apply() is void, but it handles some
error conditions too. Its caller is impossible to learn whether
it succeeds or fails, especially sk_psock_verdict_recv().

Make it return int to indicate error cases and propagate errors
to callers properly.

Fixes: ef565928 ("bpf, sockmap: Allow skipping sk_skb parser program")
Signed-off-by: NCong Wang <cong.wang@bytedance.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
Acked-by: NJakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20210615021342.7416-7-xiyou.wangcong@gmail.com

1581a6c1

skmsg: Fix a memory leak in sk_psock_verdict_apply() · 0cf6672b

由 Cong Wang 提交于 6月 14, 2021

If the dest psock does not set SK_PSOCK_TX_ENABLED,
the skb can't be queued anywhere so must be dropped.

This one is found during code review.

Fixes: 799aa7f9 ("skmsg: Avoid lock_sock() in sk_psock_backlog()")
Signed-off-by: NCong Wang <cong.wang@bytedance.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
Acked-by: NJakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20210615021342.7416-6-xiyou.wangcong@gmail.com

0cf6672b

skmsg: Clear skb redirect pointer before dropping it · 30b9c54a

由 Cong Wang 提交于 6月 14, 2021

When we drop skb inside sk_psock_skb_redirect(), we have to clear
its skb->_sk_redir pointer too, otherwise kfree_skb() would
misinterpret it as a valid skb->_skb_refdst and dst_release()
would eventually complain.

Fixes: e3526bb9 ("skmsg: Move sk_redir from TCP_SKB_CB to skb")
Reported-by: NJiang Wang <jiang.wang@bytedance.com>
Signed-off-by: NCong Wang <cong.wang@bytedance.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
Acked-by: NJakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20210615021342.7416-5-xiyou.wangcong@gmail.com

30b9c54a

skmsg: Improve udp_bpf_recvmsg() accuracy · 9f2470fb

由 Cong Wang 提交于 6月 14, 2021

I tried to reuse sk_msg_wait_data() for different protocols,
but it turns out it can not be simply reused. For example,
UDP actually uses two queues to receive skb:
udp_sk(sk)->reader_queue and sk->sk_receive_queue. So we have
to check both of them to know whether we have received any
packet.

Also, UDP does not lock the sock during BH Rx path, it makes
no sense for its ->recvmsg() to lock the sock. It is always
possible for ->recvmsg() to be called before packets actually
arrive in the receive queue, we just use best effort to make
it accurate here.

Fixes: 1f5be6b3 ("udp: Implement udp_bpf_recvmsg() for sockmap")
Signed-off-by: NCong Wang <cong.wang@bytedance.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
Acked-by: NJakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20210615021342.7416-2-xiyou.wangcong@gmail.com

9f2470fb

16 6月, 2021 1 次提交

net: inline function get_net_ns_by_fd if NET_NS is disabled · e34492de

由 Changbin Du 提交于 6月 15, 2021

The function get_net_ns_by_fd() could be inlined when NET_NS is not
enabled.
Signed-off-by: NChangbin Du <changbin.du@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e34492de

13 6月, 2021 1 次提交

net: make get_net_ns return error if NET_NS is disabled · ea6932d7

由 Changbin Du 提交于 6月 11, 2021

There is a panic in socket ioctl cmd SIOCGSKNS when NET_NS is not enabled.
The reason is that nsfs tries to access ns->ops but the proc_ns_operations
is not implemented in this case.

[7.670023] Unable to handle kernel NULL pointer dereference at virtual address 00000010
[7.670268] pgd = 32b54000
[7.670544] [00000010] *pgd=00000000
[7.671861] Internal error: Oops: 5 [#1] SMP ARM
[7.672315] Modules linked in:
[7.672918] CPU: 0 PID: 1 Comm: systemd Not tainted 5.13.0-rc3-00375-g6799d4f2 #16
[7.673309] Hardware name: Generic DT based system
[7.673642] PC is at nsfs_evict+0x24/0x30
[7.674486] LR is at clear_inode+0x20/0x9c

The same to tun SIOCGSKNS command.

To fix this problem, we make get_net_ns() return -EINVAL when NET_NS is
disabled. Meanwhile move it to right place net/core/net_namespace.c.
Signed-off-by: NChangbin Du <changbin.du@gmail.com>
Fixes: c62cce2c ("net: add an ioctl to get a socket network namespace")
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: David Laight <David.Laight@ACULAB.COM>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Suggested-by: NJakub Kicinski <kuba@kernel.org>
Acked-by: NChristian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ea6932d7

11 6月, 2021 1 次提交

skbuff: fix incorrect msg_zerocopy copy notifications · 3bdd5ee0

由 Willem de Bruijn 提交于 6月 09, 2021

msg_zerocopy signals if a send operation required copying with a flag
in serr->ee.ee_code.

This field can be incorrect as of the below commit, as a result of
both structs uarg and serr pointing into the same skb->cb[].

uarg->zerocopy must be read before skb->cb[] is reinitialized to hold
serr. Similar to other fields len, hi and lo, use a local variable to
temporarily hold the value.

This was not a problem before, when the value was passed as a function
argument.

Fixes: 75518851 ("skbuff: Push status and refcounts into sock_zerocopy_callback")
Reported-by: NTalal Ahmad <talalahmad@google.com>
Signed-off-by: NWillem de Bruijn <willemb@google.com>
Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3bdd5ee0

10 6月, 2021 1 次提交

rtnetlink: Fix regression in bridge VLAN configuration · d2e381c4

由 Ido Schimmel 提交于 6月 09, 2021

Cited commit started returning errors when notification info is not
filled by the bridge driver, resulting in the following regression:

 # ip link add name br1 type bridge vlan_filtering 1
 # bridge vlan add dev br1 vid 555 self pvid untagged
 RTNETLINK answers: Invalid argument

As long as the bridge driver does not fill notification info for the
bridge device itself, an empty notification should not be considered as
an error. This is explained in commit 59ccaaaa ("bridge: dont send
notification when skb->len == 0 in rtnl_bridge_notify").

Fix by removing the error and add a comment to avoid future bugs.

Fixes: a8db57c1 ("rtnetlink: Fix missing error code in rtnl_bridge_notify()")
Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
Reviewed-by: NNikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d2e381c4

08 6月, 2021 1 次提交

neighbour: allow NUD_NOARP entries to be forced GCed · 7a6b1ab7

由 David Ahern 提交于 6月 07, 2021

IFF_POINTOPOINT interfaces use NUD_NOARP entries for IPv6. It's possible to
fill up the neighbour table with enough entries that it will overflow for
valid connections after that.

This behaviour is more prevalent after commit 58956317 ("neighbor:
Improve garbage collection") is applied, as it prevents removal from
entries that are not NUD_FAILED, unless they are more than 5s old.

Fixes: 58956317 (neighbor: Improve garbage collection)
Reported-by: NKasper Dupont <kasperd@gjkwv.06.feb.2021.kasperd.net>
Signed-off-by: NThadeu Lima de Souza Cascardo <cascardo@canonical.com>
Signed-off-by: NDavid Ahern <dsahern@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7a6b1ab7

04 6月, 2021 2 次提交

fib: Return the correct errno code · 59607863

由 Zheng Yongjun 提交于 6月 02, 2021

When kalloc or kmemdup failed, should return ENOMEM rather than ENOBUF.
Signed-off-by: NZheng Yongjun <zhengyongjun3@huawei.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

59607863

rtnetlink: Fix missing error code in rtnl_bridge_notify() · a8db57c1

由 Jiapeng Chong 提交于 6月 02, 2021

The error code is missing in this code scenario, add the error code
'-EINVAL' to the return value 'err'.

Eliminate the follow smatch warning:

net/core/rtnetlink.c:4834 rtnl_bridge_notify() warn: missing error code
'err'.
Reported-by: NAbaci Robot <abaci@linux.alibaba.com>
Signed-off-by: NJiapeng Chong <jiapeng.chong@linux.alibaba.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a8db57c1

02 6月, 2021 1 次提交

net: sock: fix in-kernel mark setting · dd9082f4

由 Alexander Aring 提交于 5月 31, 2021

This patch fixes the in-kernel mark setting by doing an additional
sk_dst_reset() which was introduced by commit 50254256 ("sock: Reset
dst when changing sk_mark via setsockopt"). The code is now shared to
avoid any further suprises when changing the socket mark value.

Fixes: 84d1c617 ("net: sock: add sock_set_mark")
Reported-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: NAlexander Aring <aahringo@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

dd9082f4

28 5月, 2021 1 次提交

devlink: Correct VIRTUAL port to not have phys_port attributes · b28d8f0c

由 Parav Pandit 提交于 5月 26, 2021

Physical port name, port number attributes do not belong to virtual port
flavour. When VF or SF virtual ports are registered they incorrectly
append "np0" string in the netdevice name of the VF/SF.

Before this fix, VF netdevice name were ens2f0np0v0, ens2f0np0v1 for VF
0 and 1 respectively.

After the fix, they are ens2f0v0, ens2f0v1.

With this fix, reading /sys/class/net/ens2f0v0/phys_port_name returns
-EOPNOTSUPP.

Also devlink port show example for 2 VFs on one PF to ensure that any
physical port attributes are not exposed.

$ devlink port show
pci/0000:06:00.0/65535: type eth netdev ens2f0np0 flavour physical port 0 splittable false
pci/0000:06:00.3/196608: type eth netdev ens2f0v0 flavour virtual splittable false
pci/0000:06:00.4/262144: type eth netdev ens2f0v1 flavour virtual splittable false

This change introduces a netdevice name change on systemd/udev
version 245 and higher which honors phys_port_name sysfs file for
generation of netdevice name.

This also aligns to phys_port_name usage which is limited to switchdev
ports as described in [1].

[1] https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/tree/Documentation/networking/switchdev.rst

Fixes: acf1ee44 ("devlink: Introduce devlink port flavour virtual")
Signed-off-by: NParav Pandit <parav@nvidia.com>
Reviewed-by: NJiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20210526200027.14008-1-parav@nvidia.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

b28d8f0c

21 5月, 2021 1 次提交

bpf: Set mac_len in bpf_skb_change_head · 84316ca4

由 Jussi Maki 提交于 5月 19, 2021

The skb_change_head() helper did not set "skb->mac_len", which is
problematic when it's used in combination with skb_redirect_peer().
Without it, redirecting a packet from a L3 device such as wireguard to
the veth peer device will cause skb->data to point to the middle of the
IP header on entry to tcp_v4_rcv() since the L2 header is not pulled
correctly due to mac_len=0.

Fixes: 3a0af8fd ("bpf: BPF for lightweight tunnel infrastructure")
Signed-off-by: NJussi Maki <joamaki@gmail.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210519154743.2554771-2-joamaki@gmail.com

84316ca4

15 5月, 2021 3 次提交

mm: fix struct page layout on 32-bit systems · 9ddb3c14

由 Matthew Wilcox (Oracle) 提交于 5月 14, 2021

32-bit architectures which expect 8-byte alignment for 8-byte integers and
need 64-bit DMA addresses (arm, mips, ppc) had their struct page
inadvertently expanded in 2019.  When the dma_addr_t was added, it forced
the alignment of the union to 8 bytes, which inserted a 4 byte gap between
'flags' and the union.

Fix this by storing the dma_addr_t in one or two adjacent unsigned longs.
This restores the alignment to that of an unsigned long.  We always
store the low bits in the first word to prevent the PageTail bit from
being inadvertently set on a big endian platform.  If that happened,
get_user_pages_fast() racing against a page which was freed and
reallocated to the page_pool could dereference a bogus compound_head(),
which would be hard to trace back to this cause.

Link: https://lkml.kernel.org/r/20210510153211.1504886-1-willy@infradead.org
Fixes: c25fff71 ("mm: add dma_addr_t to struct page")
Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: NIlias Apalodimas <ilias.apalodimas@linaro.org>
Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
Acked-by: NVlastimil Babka <vbabka@suse.cz>
Tested-by: NMatteo Croce <mcroce@linux.microsoft.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

9ddb3c14

net: sched: fix tx action reschedule issue with stopped queue · dcad9ee9

由 Yunsheng Lin 提交于 5月 14, 2021

The netdev qeueue might be stopped when byte queue limit has
reached or tx hw ring is full, net_tx_action() may still be
rescheduled if STATE_MISSED is set, which consumes unnecessary
cpu without dequeuing and transmiting any skb because the
netdev queue is stopped, see qdisc_run_end().

This patch fixes it by checking the netdev queue state before
calling qdisc_run() and clearing STATE_MISSED if netdev queue is
stopped during qdisc_run(), the net_tx_action() is rescheduled
again when netdev qeueue is restarted, see netif_tx_wake_queue().

As there is time window between netif_xmit_frozen_or_stopped()
checking and STATE_MISSED clearing, between which STATE_MISSED
may set by net_tx_action() scheduled by netif_tx_wake_queue(),
so set the STATE_MISSED again if netdev queue is restarted.

Fixes: 6b3ba914 ("net: sched: allow qdiscs to handle locking")
Reported-by: NMichal Kubecek <mkubecek@suse.cz>
Acked-by: NJakub Kicinski <kuba@kernel.org>
Signed-off-by: NYunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

dcad9ee9

net: sched: fix tx action rescheduling issue during deactivation · 102b55ee

由 Yunsheng Lin 提交于 5月 14, 2021

Currently qdisc_run() checks the STATE_DEACTIVATED of lockless
qdisc before calling __qdisc_run(), which ultimately clear the
STATE_MISSED when all the skb is dequeued. If STATE_DEACTIVATED
is set before clearing STATE_MISSED, there may be rescheduling
of net_tx_action() at the end of qdisc_run_end(), see below:

CPU0(net_tx_atcion)  CPU1(__dev_xmit_skb)  CPU2(dev_deactivate)
          .                   .                     .
          .            set STATE_MISSED             .
          .           __netif_schedule()            .
          .                   .           set STATE_DEACTIVATED
          .                   .                qdisc_reset()
          .                   .                     .
          .<---------------   .              synchronize_net()
clear __QDISC_STATE_SCHED  |  .                     .
          .                |  .                     .
          .                |  .            some_qdisc_is_busy()
          .                |  .               return *false*
          .                |  .                     .
  test STATE_DEACTIVATED   |  .                     .
__qdisc_run() *not* called |  .                     .
          .                |  .                     .
   test STATE_MISS         |  .                     .
 __netif_schedule()--------|  .                     .
          .                   .                     .
          .                   .                     .

__qdisc_run() is not called by net_tx_atcion() in CPU0 because
CPU2 has set STATE_DEACTIVATED flag during dev_deactivate(), and
STATE_MISSED is only cleared in __qdisc_run(), __netif_schedule
is called at the end of qdisc_run_end(), causing tx action
rescheduling problem.

qdisc_run() called by net_tx_action() runs in the softirq context,
which should has the same semantic as the qdisc_run() called by
__dev_xmit_skb() protected by rcu_read_lock_bh(). And there is a
synchronize_net() between STATE_DEACTIVATED flag being set and
qdisc_reset()/some_qdisc_is_busy in dev_deactivate(), we can safely
bail out for the deactived lockless qdisc in net_tx_action(), and
qdisc_reset() will reset all skb not dequeued yet.

So add the rcu_read_lock() explicitly to protect the qdisc_run()
and do the STATE_DEACTIVATED checking in net_tx_action() before
calling qdisc_run_begin(). Another option is to do the checking in
the qdisc_run_end(), but it will add unnecessary overhead for
non-tx_action case, because __dev_queue_xmit() will not see qdisc
with STATE_DEACTIVATED after synchronize_net(), the qdisc with
STATE_DEACTIVATED can only be seen by net_tx_action() because of
__netif_schedule().

The STATE_DEACTIVATED checking in qdisc_run() is to avoid race
between net_tx_action() and qdisc_reset(), see:
commit d518d2ed ("net/sched: fix race between deactivation
and dequeue for NOLOCK qdisc"). As the bailout added above for
deactived lockless qdisc in net_tx_action() provides better
protection for the race without calling qdisc_run() at all, so
remove the STATE_DEACTIVATED checking in qdisc_run().

After qdisc_reset(), there is no skb in qdisc to be dequeued, so
clear the STATE_MISSED in dev_reset_queue() too.

Fixes: 6b3ba914 ("net: sched: allow qdiscs to handle locking")
Acked-by: NJakub Kicinski <kuba@kernel.org>
Signed-off-by: NYunsheng Lin <linyunsheng@huawei.com>
V8: Clearing STATE_MISSED before calling __netif_schedule() has
    avoid the endless rescheduling problem, but there may still
    be a unnecessary rescheduling, so adjust the commit log.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

102b55ee

13 5月, 2021 1 次提交

net: really orphan skbs tied to closing sk · 098116e7

由 Paolo Abeni 提交于 5月 11, 2021

If the owing socket is shutting down - e.g. the sock reference
count already dropped to 0 and only sk_wmem_alloc is keeping
the sock alive, skb_orphan_partial() becomes a no-op.

When forwarding packets over veth with GRO enabled, the above
causes refcount errors.

This change addresses the issue with a plain skb_orphan() call
in the critical scenario.

Fixes: 9adc89af ("net: let skb_orphan_partial wake-up waiters.")
Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

098116e7

01 5月, 2021 2 次提交

net: page_pool: use alloc_pages_bulk in refill code path · be5dba25

由 Jesper Dangaard Brouer 提交于 4月 29, 2021

There are cases where the page_pool need to refill with pages from the
page allocator.  Some workloads cause the page_pool to release pages
instead of recycling these pages.

For these workload it can improve performance to bulk alloc pages from the
page-allocator to refill the alloc cache.

For XDP-redirect workload with 100G mlx5 driver (that use page_pool)
redirecting xdp_frame packets into a veth, that does XDP_PASS to create an
SKB from the xdp_frame, which then cannot return the page to the
page_pool.

Performance results under GitHub xdp-project[1]:
 [1] https://github.com/xdp-project/xdp-project/blob/master/areas/mem/page_pool06_alloc_pages_bulk.org

Mel: The patch "net: page_pool: convert to use alloc_pages_bulk_array
variant" was squashed with this patch. From the test page, the array
variant was superior with one of the test results as follows.

	Kernel		XDP stats       CPU     pps           Delta
	Baseline	XDP-RX CPU      total   3,771,046       n/a
	List		XDP-RX CPU      total   3,940,242    +4.49%
	Array		XDP-RX CPU      total   4,249,224   +12.68%

Link: https://lkml.kernel.org/r/20210325114228.27719-10-mgorman@techsingularity.netSigned-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
Reviewed-by: NAlexander Lobakin <alobakin@pm.me>
Cc: Alexander Duyck <alexander.duyck@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: David Miller <davem@davemloft.net>
Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

be5dba25

net: page_pool: refactor dma_map into own function page_pool_dma_map · dfa59717

由 Jesper Dangaard Brouer 提交于 4月 29, 2021

In preparation for next patch, move the dma mapping into its own function,
as this will make it easier to follow the changes.

[ilias.apalodimas: make page_pool_dma_map return boolean]

Link: https://lkml.kernel.org/r/20210325114228.27719-9-mgorman@techsingularity.netSigned-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
Reviewed-by: NIlias Apalodimas <ilias.apalodimas@linaro.org>
Reviewed-by: NAlexander Lobakin <alobakin@pm.me>
Cc: Alexander Duyck <alexander.duyck@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: David Miller <davem@davemloft.net>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

dfa59717

29 4月, 2021 1 次提交

net: selftest: fix build issue if INET is disabled · 4a52dd8f

由 Oleksij Rempel 提交于 4月 28, 2021

In case ethernet driver is enabled and INET is disabled, selftest will
fail to build.
Reported-by: NRandy Dunlap <rdunlap@infradead.org>
Fixes: 3e1e58d6 ("net: add generic selftest support")
Signed-off-by: NOleksij Rempel <o.rempel@pengutronix.de>
Acked-by: Randy Dunlap <rdunlap@infradead.org> # build-tested
Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
Link: https://lore.kernel.org/r/20210428130947.29649-1-o.rempel@pengutronix.deSigned-off-by: NJakub Kicinski <kuba@kernel.org>

4a52dd8f

24 4月, 2021 2 次提交

devlink: Extend SF port attributes to have external attribute · a1ab3e45

由 Parav Pandit 提交于 3月 10, 2021

Extended SF port attributes to have optional external flag similar to
PCI PF and VF port attributes.

External atttibute is required to generate unique phys_port_name when PF number
and SF number are overlapping between two controllers similar to SR-IOV
VFs.

When a SF is for external controller an example view of external SF
port and config sequence.

On eswitch system:
$ devlink dev eswitch set pci/0033:01:00.0 mode switchdev

$ devlink port show
pci/0033:01:00.0/196607: type eth netdev enP51p1s0f0np0 flavour physical port 0 splittable false
pci/0033:01:00.0/131072: type eth netdev eth0 flavour pcipf controller 1 pfnum 0 external true splittable false
  function:
    hw_addr 00:00:00:00:00:00

$ devlink port add pci/0033:01:00.0 flavour pcisf pfnum 0 sfnum 77 controller 1
pci/0033:01:00.0/163840: type eth netdev eth1 flavour pcisf controller 1 pfnum 0 sfnum 77 splittable false
  function:
    hw_addr 00:00:00:00:00:00 state inactive opstate detached

phys_port_name construction:
$ cat /sys/class/net/eth1/phys_port_name
c1pf0sf77
Signed-off-by: NParav Pandit <parav@nvidia.com>
Reviewed-by: NJiri Pirko <jiri@nvidia.com>
Reviewed-by: NVu Pham <vuhuong@nvidia.com>
Signed-off-by: NSaeed Mahameed <saeedm@nvidia.com>

a1ab3e45

net: sock: remove the unnecessary check in proto_register · ed744d81

由 Tonghao Zhang 提交于 4月 22, 2021

tw_prot_cleanup will check the twsk_prot.

Fixes: 0f5907af ("net: Fix potential memory leak in proto_register()")
Cc: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: NTonghao Zhang <xiangxia.m.yue@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ed744d81

23 4月, 2021 1 次提交

net, xdp: Update pkt_type if generic XDP changes unicast MAC · 22b60343

由 Martin Willi 提交于 4月 19, 2021

If a generic XDP program changes the destination MAC address from/to
multicast/broadcast, the skb->pkt_type is updated to properly handle
the packet when passed up the stack. When changing the MAC from/to
the NICs MAC, PACKET_HOST/OTHERHOST is not updated, though, making
the behavior different from that of native XDP.

Remember the PACKET_HOST/OTHERHOST state before calling the program
in generic XDP, and update pkt_type accordingly if the destination
MAC address has changed. As eth_type_trans() assumes a default
pkt_type of PACKET_HOST, restore that before calling it.

The use case for this is when a XDP program wants to push received
packets up the stack by rewriting the MAC to the NICs MAC, for
example by cluster nodes sharing MAC addresses.

Fixes: 29724956 ("net: fix generic XDP to handle if eth header was mangled")
Signed-off-by: NMartin Willi <martin@strongswan.org>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NToke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20210419141559.8611-1-martin@strongswan.org

22b60343

22 4月, 2021 1 次提交

neighbour: Prevent Race condition in neighbour subsytem · eefb45ee

由 Chinmay Agarwal 提交于 4月 22, 2021

Following Race Condition was detected:

<CPU A, t0>: Executing: __netif_receive_skb() ->__netif_receive_skb_core()
-> arp_rcv() -> arp_process().arp_process() calls __neigh_lookup() which
takes a reference on neighbour entry 'n'.
Moves further along, arp_process() and calls neigh_update()->
__neigh_update(). Neighbour entry is unlocked just before a call to
neigh_update_gc_list.

This unlocking paves way for another thread that may take a reference on
the same and mark it dead and remove it from gc_list.

<CPU B, t1> - neigh_flush_dev() is under execution and calls
neigh_mark_dead(n) marking the neighbour entry 'n' as dead. Also n will be
removed from gc_list.
Moves further along neigh_flush_dev() and calls
neigh_cleanup_and_release(n), but since reference count increased in t1,
'n' couldn't be destroyed.

<CPU A, t3>- Code hits neigh_update_gc_list, with neighbour entry
set as dead.

<CPU A, t4> - arp_process() finally calls neigh_release(n), destroying
the neighbour entry and we have a destroyed ntry still part of gc_list.

Fixes: eb4e8fac("neighbour: Prevent a dead entry from updating gc_list")
Signed-off-by: NChinmay Agarwal <chinagar@codeaurora.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

eefb45ee

21 4月, 2021 1 次提交

net: add generic selftest support · 3e1e58d6

由 Oleksij Rempel 提交于 4月 19, 2021

Port some parts of the stmmac selftest and reuse it as basic generic selftest
library. This patch was tested with following combinations:
- iMX6DL FEC -> AT8035
- iMX6DL FEC -> SJA1105Q switch -> KSZ8081
- iMX6DL FEC -> SJA1105Q switch -> KSZ9031
- AR9331 ag71xx -> AR9331 PHY
- AR9331 ag71xx -> AR9331 switch -> AR9331 PHY
Signed-off-by: NOleksij Rempel <o.rempel@pengutronix.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3e1e58d6

20 4月, 2021 1 次提交

gro: fix napi_gro_frags() Fast GRO breakage due to IP alignment check · 7ad18ff6

由 Alexander Lobakin 提交于 4月 19, 2021

Commit 38ec4944 ("gro: ensure frag0 meets IP header alignment")
did the right thing, but missed the fact that napi_gro_frags() logics
calls for skb_gro_reset_offset() *before* pulling Ethernet header
to the skb linear space.
That said, the introduced check for frag0 address being aligned to 4
always fails for it as Ethernet header is obviously 14 bytes long,
and in case with NET_IP_ALIGN its start is not aligned to 4.

Fix this by adding @nhoff argument to skb_gro_reset_offset() which
tells if an IP header is placed right at the start of frag0 or not.
This restores Fast GRO for napi_gro_frags() that became very slow
after the mentioned commit, and preserves the introduced check to
avoid silent unaligned accesses.

From v1 [0]:
 - inline tiny skb_gro_reset_offset() to let the code be optimized
   more efficively (esp. for the !NET_IP_ALIGN case) (Eric);
 - pull in Reviewed-by from Eric.

[0] https://lore.kernel.org/netdev/20210418114200.5839-1-alobakin@pm.me

Fixes: 38ec4944 ("gro: ensure frag0 meets IP header alignment")
Reviewed-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NAlexander Lobakin <alobakin@pm.me>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7ad18ff6

17 4月, 2021 2 次提交

flow_dissector: Fix out-of-bounds warning in __skb_flow_bpf_to_target() · 1e3d976d

由 Gustavo A. R. Silva 提交于 4月 16, 2021

Fix the following out-of-bounds warning:

net/core/flow_dissector.c:835:3: warning: 'memcpy' offset [33, 48] from the object at 'flow_keys' is out of the bounds of referenced subobject 'ipv6_src' with type '__u32[4]' {aka 'unsigned int[4]'} at offset 16 [-Warray-bounds]

The problem is that the original code is trying to copy data into a
couple of struct members adjacent to each other in a single call to
memcpy(). So, the compiler legitimately complains about it. As these
are just a couple of members, fix this by copying each one of them in
separate calls to memcpy().

This helps with the ongoing efforts to globally enable -Warray-bounds
and get us closer to being able to tighten the FORTIFY_SOURCE routines
on memcpy().

Link: https://github.com/KSPP/linux/issues/109Reported-by: Nkernel test robot <lkp@intel.com>
Signed-off-by: NGustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1e3d976d

scm: fix a typo in put_cmsg() · e7ad33fa

由 Eric Dumazet 提交于 4月 16, 2021

We need to store cmlen instead of len in cm->cmsg_len.

Fixes: 38ebcf50 ("scm: optimize put_cmsg()")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Reported-by: NJakub Kicinski <kuba@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e7ad33fa

16 4月, 2021 1 次提交

scm: optimize put_cmsg() · 38ebcf50

由 Eric Dumazet 提交于 4月 15, 2021

Calling two copy_to_user() for very small regions has very high overhead.

Switch to inlined unsafe_put_user() to save one stac/clac sequence,
and avoid copy_to_user().
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

38ebcf50

15 4月, 2021 1 次提交

skbuff: revert "skbuff: remove some unnecessary operation in skb_segment_list()" · 17c3df70

由 Paolo Abeni 提交于 4月 14, 2021

the commit 1ddc3229 ("skbuff: remove some unnecessary operation
in skb_segment_list()") introduces an issue very similar to the
one already fixed by commit 53475c5d ("net: fix use-after-free when
UDP GRO with shared fraglist").

If the GSO skb goes though skb_clone() and pskb_expand_head() before
entering skb_segment_list(), the latter  will unshare the frag_list
skbs and will release the old list. With the reverted commit in place,
when skb_segment_list() completes, skb->next points to the just
released list, and later on the kernel will hit UaF.

Note that since commit e0e3070a ("udp: properly complete L4 GRO
over UDP tunnel packet") the critical scenario can be reproduced also
receiving UDP over vxlan traffic with:

NIC (NETIF_F_GRO_FRAGLIST enabled) -> vxlan -> UDP sink

Attaching a packet socket to the NIC will cause skb_clone() and the
tunnel decapsulation will call pskb_expand_head().

Fixes: 1ddc3229 ("skbuff: remove some unnecessary operation in skb_segment_list()")
Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

17c3df70

14 4月, 2021 1 次提交

gro: ensure frag0 meets IP header alignment · 38ec4944

由 Eric Dumazet 提交于 4月 13, 2021

After commit 0f6925b3 ("virtio_net: Do not pull payload in skb->head")
Guenter Roeck reported one failure in his tests using sh architecture.

After much debugging, we have been able to spot silent unaligned accesses
in inet_gro_receive()

The issue at hand is that upper networking stacks assume their header
is word-aligned. Low level drivers are supposed to reserve NET_IP_ALIGN
bytes before the Ethernet header to make that happen.

This patch hardens skb_gro_reset_offset() to not allow frag0 fast-path
if the fragment is not properly aligned.

Some arches like x86, arm64 and powerpc do not care and define NET_IP_ALIGN
as 0, this extra check will be a NOP for them.

Note that if frag0 is not used, GRO will call pskb_may_pull()
as many times as needed to pull network and transport headers.

Fixes: 0f6925b3 ("virtio_net: Do not pull payload in skb->head")
Fixes: 78a478d0 ("gro: Inline skb_gro_header and cache frag0 virtual address")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Reported-by: NGuenter Roeck <linux@roeck-us.net>
Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Acked-by: NMichael S. Tsirkin <mst@redhat.com>
Tested-by: NGuenter Roeck <linux@roeck-us.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

38ec4944

12 4月, 2021 2 次提交

sock_map: Fix a potential use-after-free in sock_map_close() · aadb2bb8

由 Cong Wang 提交于 4月 07, 2021

The last refcnt of the psock can be gone right after
sock_map_remove_links(), so sk_psock_stop() could trigger a UAF.
The reason why I placed sk_psock_stop() there is to avoid RCU read
critical section, and more importantly, some callee of
sock_map_remove_links() is supposed to be called with RCU read lock,
we can not simply get rid of RCU read lock here. Therefore, the only
choice we have is to grab an additional refcnt with sk_psock_get()
and put it back after sk_psock_stop().

Fixes: 799aa7f9 ("skmsg: Avoid lock_sock() in sk_psock_backlog()")
Reported-by: syzbot+7b6548ae483d6f4c64ae@syzkaller.appspotmail.com
Signed-off-by: NCong Wang <cong.wang@bytedance.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
Acked-by: NJakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20210408030556.45134-1-xiyou.wangcong@gmail.com

aadb2bb8

skmsg: Pass psock pointer to ->psock_update_sk_prot() · 51e0158a

由 Cong Wang 提交于 4月 06, 2021

Using sk_psock() to retrieve psock pointer from sock requires
RCU read lock, but we already get psock pointer before calling
->psock_update_sk_prot() in both cases, so we can just pass it
without bothering sk_psock().

Fixes: 8a59f9d1 ("sock: Introduce sk->sk_prot->psock_update_sk_prot()")
Reported-by: syzbot+320a3bc8d80f478c37e4@syzkaller.appspotmail.com
Signed-off-by: NCong Wang <cong.wang@bytedance.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Tested-by: syzbot+320a3bc8d80f478c37e4@syzkaller.appspotmail.com
Reviewed-by: NJakub Sitnicki <jakub@cloudflare.com>
Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20210407032111.33398-1-xiyou.wangcong@gmail.com

51e0158a

10 4月, 2021 1 次提交

net: fix hangup on napi_disable for threaded napi · 27f0ad71

由 Paolo Abeni 提交于 4月 09, 2021

napi_disable() is subject to an hangup, when the threaded
mode is enabled and the napi is under heavy traffic.

If the relevant napi has been scheduled and the napi_disable()
kicks in before the next napi_threaded_wait() completes - so
that the latter quits due to the napi_disable_pending() condition,
the existing code leaves the NAPI_STATE_SCHED bit set and the
napi_disable() loop waiting for such bit will hang.

This patch addresses the issue by dropping the NAPI_STATE_DISABLE
bit test in napi_thread_wait(). The later napi_threaded_poll()
iteration will take care of clearing the NAPI_STATE_SCHED.

This also addresses a related problem reported by Jakub:
before this patch a napi_disable()/napi_enable() pair killed
the napi thread, effectively disabling the threaded mode.
On the patched kernel napi_disable() simply stops scheduling
the relevant thread.

v1 -> v2:
  - let the main napi_thread_poll() loop clear the SCHED bit
Reported-by: NJakub Kicinski <kuba@kernel.org>
Fixes: 29863d41 ("net: implement threaded-able napi poll loop support")
Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
Reviewed-by: NEric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/883923fa22745a9589e8610962b7dc59df09fb1f.1617981844.git.pabeni@redhat.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

27f0ad71

09 4月, 2021 1 次提交

ipv6: report errors for iftoken via netlink extack · 3583a4e8

由 Stephen Hemminger 提交于 4月 07, 2021

Setting iftoken can fail for several different reasons but there
and there was no report to user as to the cause. Add netlink
extended errors to the processing of the request.

This requires adding additional argument through rtnl_af_ops
set_link_af callback.
Reported-by: NHongren Zheng <li@zenithal.me>
Signed-off-by: NStephen Hemminger <stephen@networkplumber.org>
Reviewed-by: NDavid Ahern <dsahern@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3583a4e8

08 4月, 2021 2 次提交

net: remove the new_ifindex argument from dev_change_net_namespace · 0854fa82

由 Andrei Vagin 提交于 4月 06, 2021

Here is only one place where we want to specify new_ifindex. In all
other cases, callers pass 0 as new_ifindex. It looks reasonable to add a
low-level function with new_ifindex and to convert
dev_change_net_namespace to a static inline wrapper.

Fixes: eeb85a14 ("net: Allow to specify ifindex when device is moved to another namespace")
Suggested-by: NJakub Kicinski <kuba@kernel.org>
Signed-off-by: NAndrei Vagin <avagin@gmail.com>
Acked-by: NJakub Kicinski <kuba@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0854fa82

net: introduce nla_policy for IFLA_NEW_IFINDEX · 7e4a5131

由 Andrei Vagin 提交于 4月 06, 2021

In this case, we don't need to check that new_ifindex is positive in
validate_linkmsg.

Fixes: eeb85a14 ("net: Allow to specify ifindex when device is moved to another namespace")
Suggested-by: NJakub Kicinski <kuba@kernel.org>
Signed-off-by: NAndrei Vagin <avagin@gmail.com>
Acked-by: NJakub Kicinski <kuba@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7e4a5131

07 4月, 2021 1 次提交

bpf, sockmap: Fix incorrect fwd_alloc accounting · 144748eb

由 John Fastabend 提交于 4月 01, 2021

Incorrect accounting fwd_alloc can result in a warning when the socket
is torn down,

 [18455.319240] WARNING: CPU: 0 PID: 24075 at net/core/stream.c:208 sk_stream_kill_queues+0x21f/0x230
 [...]
 [18455.319543] Call Trace:
 [18455.319556]  inet_csk_destroy_sock+0xba/0x1f0
 [18455.319577]  tcp_rcv_state_process+0x1b4e/0x2380
 [18455.319593]  ? lock_downgrade+0x3a0/0x3a0
 [18455.319617]  ? tcp_finish_connect+0x1e0/0x1e0
 [18455.319631]  ? sk_reset_timer+0x15/0x70
 [18455.319646]  ? tcp_schedule_loss_probe+0x1b2/0x240
 [18455.319663]  ? lock_release+0xb2/0x3f0
 [18455.319676]  ? __release_sock+0x8a/0x1b0
 [18455.319690]  ? lock_downgrade+0x3a0/0x3a0
 [18455.319704]  ? lock_release+0x3f0/0x3f0
 [18455.319717]  ? __tcp_close+0x2c6/0x790
 [18455.319736]  ? tcp_v4_do_rcv+0x168/0x370
 [18455.319750]  tcp_v4_do_rcv+0x168/0x370
 [18455.319767]  __release_sock+0xbc/0x1b0
 [18455.319785]  __tcp_close+0x2ee/0x790
 [18455.319805]  tcp_close+0x20/0x80

This currently happens because on redirect case we do skb_set_owner_r()
with the original sock. This increments the fwd_alloc memory accounting
on the original sock. Then on redirect we may push this into the queue
of the psock we are redirecting to. When the skb is flushed from the
queue we give the memory back to the original sock. The problem is if
the original sock is destroyed/closed with skbs on another psocks queue
then the original sock will not have a way to reclaim the memory before
being destroyed. Then above warning will be thrown

  sockA                          sockB

  sk_psock_strp_read()
   sk_psock_verdict_apply()
     -- SK_REDIRECT --
     sk_psock_skb_redirect()
                                skb_queue_tail(psock_other->ingress_skb..)

  sk_close()
   sock_map_unref()
     sk_psock_put()
       sk_psock_drop()
         sk_psock_zap_ingress()

At this point we have torn down our own psock, but have the outstanding
skb in psock_other. Note that SK_PASS doesn't have this problem because
the sk_psock_drop() logic releases the skb, its still associated with
our psock.

To resolve lets only account for sockets on the ingress queue that are
still associated with the current socket. On the redirect case we will
check memory limits per 6fa9201a, but will omit fwd_alloc accounting
until skb is actually enqueued. When the skb is sent via skb_send_sock_locked
or received with sk_psock_skb_ingress memory will be claimed on psock_other.

Fixes: 6fa9201a ("bpf, sockmap: Avoid returning unneeded EAGAIN when redirecting to self")
Reported-by: NAndrii Nakryiko <andrii@kernel.org>
Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/161731444013.68884.4021114312848535993.stgit@john-XPS-13-9370

144748eb

06 4月, 2021 1 次提交

net: Allow to specify ifindex when device is moved to another namespace · eeb85a14

由 Andrei Vagin 提交于 4月 05, 2021

Currently, we can specify ifindex on link creation. This change allows
to specify ifindex when a device is moved to another network namespace.

Even now, a device ifindex can be changed if there is another device
with the same ifindex in the target namespace. So this change doesn't
introduce completely new behavior, it adds more control to the process.

CRIU users want to restore containers with pre-created network devices.
A user will provide network devices and instructions where they have to
be restored, then CRIU will restore network namespaces and move devices
into them. The problem is that devices have to be restored with the same
indexes that they have before C/R.

Cc: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
Suggested-by: NChristian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: NAndrei Vagin <avagin@gmail.com>
Reviewed-by: NChristian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

eeb85a14

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功