提交 · 5cf92bbadc585e1bcb710df75293e07b7c846bb6 · openeuler / Kernel

23 1月, 2021 3 次提交

mptcp: re-enable sndbuf autotune · 5cf92bba

由 Paolo Abeni 提交于 1月 20, 2021

After commit 6e628cd3 ("mptcp: use mptcp release_cb for
delayed tasks"), MPTCP never sets the flag bit SOCK_NOSPACE
on its subflow. As a side effect, autotune never takes place,
as it happens inside tcp_new_space(), which in turn is called
only when the mentioned bit is set.

Let's sendmsg() set the subflows NOSPACE bit when looking for
more memory and use the subflow write_space callback to propagate
the snd buf update and wake-up the user-space.

Additionally, this allows dropping a bunch of duplicate code and
makes the SNDBUF_LIMITED chrono relevant again for MPTCP subflows.

Fixes: 6e628cd3 ("mptcp: use mptcp release_cb for delayed tasks")
Reviewed-by: NMat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

5cf92bba

mptcp: always graft subflow socket to parent · 866f26f2

由 Paolo Abeni 提交于 1月 20, 2021

Currently, incoming subflows link to the parent socket,
while outgoing ones link to a per subflow socket. The latter
is not really needed, except at the initial connect() time and
for the first subflow.

Always graft the outgoing subflow to the parent socket and
free the unneeded ones early.

This allows some code cleanup, reduces the amount of memory
used and will simplify the next patch
Reviewed-by: NMat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

866f26f2

tcp: add TTL to SCM_TIMESTAMPING_OPT_STATS · e7ed11ee

由 Yousuk Seung 提交于 1月 20, 2021

This patch adds TCP_NLA_TTL to SCM_TIMESTAMPING_OPT_STATS that exports
the time-to-live or hop limit of the latest incoming packet with
SCM_TSTAMP_ACK. The value exported may not be from the packet that acks
the sequence when incoming packets are aggregated. Exporting the
time-to-live or hop limit value of incoming packets helps to estimate
the hop count of the path of the flow that may change over time.
Signed-off-by: NYousuk Seung <ysseung@google.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NNeal Cardwell <ncardwell@google.com>
Link: https://lore.kernel.org/r/20210120204155.552275-1-ysseung@google.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

e7ed11ee

21 1月, 2021 11 次提交

ip_gre: remove CRC flag from dev features in gre_gso_segment · 1a236766

由 Xin Long 提交于 1月 16, 2021

This patch is to let it always do CRC checksum in sctp_gso_segment()
by removing CRC flag from the dev features in gre_gso_segment() for
SCTP over GRE, just as it does in Commit 527beb8e ("udp: support
sctp over udp in skb_udp_tunnel_segment") for SCTP over UDP.

It could set csum/csum_start in GSO CB properly in sctp_gso_segment()
after that commit, so it would do checksum with gso_make_checksum()
in gre_gso_segment(), and Commit 622e32b7 ("net: gre: recompute
gre csum for sctp over gre tunnels") can be reverted now.

Note that when need_csum is false, we can still leave CRC checksum
of SCTP to HW by not clearing this CRC flag if it's supported, as
Jakub and Alex noticed.

v1->v2:
  - improve the changelog.
  - fix "rev xmas tree" in varibles declaration.
v2->v3:
  - remove CRC flag from dev features only when need_csum is true.
Signed-off-by: NXin Long <lucien.xin@gmail.com>
Link: https://lore.kernel.org/r/00439f24d5f69e2c6fa2beadc681d056c15c258f.1610772251.git.lucien.xin@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

1a236766

udp: not remove the CRC flag from dev features when need_csum is false · 4eb5d4a5

由 Xin Long 提交于 1月 16, 2021

In __skb_udp_tunnel_segment(), when it's a SCTP over VxLAN/GENEVE
packet and need_csum is false, which means the outer udp checksum
doesn't need to be computed, csum_start and csum_offset could be
used by the inner SCTP CRC CSUM for SCTP HW CRC offload.

So this patch is to not remove the CRC flag from dev features when
need_csum is false.
Signed-off-by: NXin Long <lucien.xin@gmail.com>
Link: https://lore.kernel.org/r/1e81b700642498546eaa3f298e023fd7ad394f85.1610776757.git.lucien.xin@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

4eb5d4a5

net/sched: cls_flower add CT_FLAGS_INVALID flag support · 7baf2429

由 wenxu 提交于 1月 19, 2021

This patch add the TCA_FLOWER_KEY_CT_FLAGS_INVALID flag to
match the ct_state with invalid for conntrack.
Signed-off-by: Nwenxu <wenxu@ucloud.cn>
Acked-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Link: https://lore.kernel.org/r/1611045110-682-1-git-send-email-wenxu@ucloud.cnSigned-off-by: NJakub Kicinski <kuba@kernel.org>

7baf2429

net: inline rollback_registered_many() · 0cbe1e57

由 Jakub Kicinski 提交于 1月 19, 2021

Similar to the change for rollback_registered() -
rollback_registered_many() was a part of unregister_netdevice_many()
minus the net_set_todo(), which is no longer needed.

Functionally this patch moves the list_empty() check back after:

	BUG_ON(dev_boot_phase);
	ASSERT_RTNL();

but I can't find any reason why that would be an issue.
Reviewed-by: NEdwin Peer <edwin.peer@broadcom.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

0cbe1e57

net: move rollback_registered_many() · bcfe2f1a

由 Jakub Kicinski 提交于 1月 19, 2021

Move rollback_registered_many() and add a temporary
forward declaration to make merging the code into
unregister_netdevice_many() easier to review.

No functional changes.
Reviewed-by: NEdwin Peer <edwin.peer@broadcom.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

bcfe2f1a

net: inline rollback_registered() · 037e56bd

由 Jakub Kicinski 提交于 1月 19, 2021

rollback_registered() is a local helper, it's common for driver
code to call unregister_netdevice_queue(dev, NULL) when they
want to unregister netdevices under rtnl_lock. Inline
rollback_registered() and adjust the only remaining caller.
Reviewed-by: NEdwin Peer <edwin.peer@broadcom.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

037e56bd

net: move net_set_todo inside rollback_registered() · 2014beea

由 Jakub Kicinski 提交于 1月 19, 2021

Commit 93ee31f1 ("[NET]: Fix free_netdev on register_netdev
failure.") moved net_set_todo() outside of rollback_registered()
so that rollback_registered() can be used in the failure path of
register_netdevice() but without risking a double free.

Since commit cf124db5 ("net: Fix inconsistent teardown and
release of private netdev state."), however, we have a better
way of handling that condition, since destructors don't call
free_netdev() directly.

After the change in commit c269a24c ("net: make free_netdev()
more lenient with unregistering devices") we can now move
net_set_todo() back.
Reviewed-by: NEdwin Peer <edwin.peer@broadcom.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

2014beea

nexthop: Specialize rtm_nh_policy · 643d0878

由 Petr Machata 提交于 1月 20, 2021

This policy is currently only used for creation of new next hops and new
next hop groups. Rename it accordingly and remove the two attributes that
are not valid in that context: NHA_GROUPS and NHA_MASTER.

For consistency with other policies, do not mention policy array size in
the declarator, and replace NHA_MAX for ARRAY_SIZE as appropriate.

Note that with this commit, NHA_MAX and __NHA_MAX are not used anymore.
Leave them in purely as a user API.
Signed-off-by: NPetr Machata <petrm@nvidia.com>
Reviewed-by: NDavid Ahern <dsahern@kernel.org>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

643d0878

nexthop: Use a dedicated policy for nh_valid_dump_req() · 44551bff

由 Petr Machata 提交于 1月 20, 2021

This function uses the global nexthop policy, but only accepts four
particular attributes. Create a new policy that only includes the four
supported attributes, and use it. Convert the loop to a series of ifs.
Signed-off-by: NPetr Machata <petrm@nvidia.com>
Reviewed-by: NDavid Ahern <dsahern@kernel.org>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

44551bff

nexthop: Use a dedicated policy for nh_valid_get_del_req() · 60f5ad5e

由 Petr Machata 提交于 1月 20, 2021

This function uses the global nexthop policy only to then bounce all
arguments except for NHA_ID. Instead, just create a new policy that
only includes the one allowed attribute.
Signed-off-by: NPetr Machata <petrm@nvidia.com>
Reviewed-by: NDavid Ahern <dsahern@kernel.org>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

60f5ad5e

tcp: Fix potential use-after-free due to double kfree() · c89dffc7

由 Kuniyuki Iwashima 提交于 1月 18, 2021

Receiving ACK with a valid SYN cookie, cookie_v4_check() allocates struct
request_sock and then can allocate inet_rsk(req)->ireq_opt. After that,
tcp_v4_syn_recv_sock() allocates struct sock and copies ireq_opt to
inet_sk(sk)->inet_opt. Normally, tcp_v4_syn_recv_sock() inserts the full
socket into ehash and sets NULL to ireq_opt. Otherwise,
tcp_v4_syn_recv_sock() has to reset inet_opt by NULL and free the full
socket.

The commit 01770a16 ("tcp: fix race condition when creating child
sockets from syncookies") added a new path, in which more than one cores
create full sockets for the same SYN cookie. Currently, the core which
loses the race frees the full socket without resetting inet_opt, resulting
in that both sock_put() and reqsk_put() call kfree() for the same memory:

  sock_put
    sk_free
      __sk_free
        sk_destruct
          __sk_destruct
            sk->sk_destruct/inet_sock_destruct
              kfree(rcu_dereference_protected(inet->inet_opt, 1));

  reqsk_put
    reqsk_free
      __reqsk_free
        req->rsk_ops->destructor/tcp_v4_reqsk_destructor
          kfree(rcu_dereference_protected(inet_rsk(req)->ireq_opt, 1));

Calling kmalloc() between the double kfree() can lead to use-after-free, so
this patch fixes it by setting NULL to inet_opt before sock_put().

As a side note, this kind of issue does not happen for IPv6. This is
because tcp_v6_syn_recv_sock() clones both ipv6_opt and pktopts which
correspond to ireq_opt in IPv4.

Fixes: 01770a16 ("tcp: fix race condition when creating child sockets from syncookies")
CC: Ricardo Dias <rdias@singlestore.com>
Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.co.jp>
Reviewed-by: NBenjamin Herrenschmidt <benh@amazon.com>
Reviewed-by: NEric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20210118055920.82516-1-kuniyu@amazon.co.jpSigned-off-by: NJakub Kicinski <kuba@kernel.org>

c89dffc7

20 1月, 2021 11 次提交

tcp: fix TCP socket rehash stats mis-accounting · 9c30ae83

由 Yuchung Cheng 提交于 1月 19, 2021

The previous commit 32efcc06 ("tcp: export count for rehash attempts")
would mis-account rehashing SNMP and socket stats:

  a. During handshake of an active open, only counts the first
     SYN timeout

  b. After handshake of passive and active open, stop updating
     after (roughly) TCP_RETRIES1 recurring RTOs

  c. After the socket aborts, over count timeout_rehash by 1

This patch fixes this by checking the rehash result from sk_rethink_txhash.

Fixes: 32efcc06 ("tcp: export count for rehash attempts")
Signed-off-by: NYuchung Cheng <ycheng@google.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NNeal Cardwell <ncardwell@google.com>
Link: https://lore.kernel.org/r/20210119192619.1848270-1-ycheng@google.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

9c30ae83

net: fix GSO for SG-enabled devices · 00b229f7

由 Paolo Abeni 提交于 1月 19, 2021

The commit dbd50f23 ("net: move the hsize check to the else
block in skb_segment") introduced a data corruption for devices
supporting scatter-gather.

The problem boils down to signed/unsigned comparison given
unexpected results: if signed 'hsize' is negative, it will be
considered greater than a positive 'len', which is unsigned.

This commit addresses resorting to the old checks order, so that
'hsize' never has a negative value when compared with 'len'.

v1 -> v2:
 - reorder hsize checks instead of explicit cast (Alex)
Bisected-by: NMatthieu Baerts <matthieu.baerts@tessares.net>
Fixes: dbd50f23 ("net: move the hsize check to the else block in skb_segment")
Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
Reviewed-by: NXin Long <lucien.xin@gmail.com>
Link: https://lore.kernel.org/r/861947c2d2d087db82af93c21920ce8147d15490.1611074818.git.pabeni@redhat.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

00b229f7

tcp: do not mess with cloned skbs in tcp_add_backlog() · b160c285

由 Eric Dumazet 提交于 1月 19, 2021

Heiner Kallweit reported that some skbs were sent with
the following invalid GSO properties :
- gso_size > 0
- gso_type == 0

This was triggerring a WARN_ON_ONCE() in rtl8169_tso_csum_v2.

Juerg Haefliger was able to reproduce a similar issue using
a lan78xx NIC and a workload mixing TCP incoming traffic
and forwarded packets.

The problem is that tcp_add_backlog() is writing
over gso_segs and gso_size even if the incoming packet will not
be coalesced to the backlog tail packet.

While skb_try_coalesce() would bail out if tail packet is cloned,
this overwriting would lead to corruptions of other packets
cooked by lan78xx, sharing a common super-packet.

The strategy used by lan78xx is to use a big skb, and split
it into all received packets using skb_clone() to avoid copies.
The drawback of this strategy is that all the small skb share a common
struct skb_shared_info.

This patch rewrites TCP gso_size/gso_segs handling to only
happen on the tail skb, since skb_try_coalesce() made sure
it was not cloned.

Fixes: 4f693b55 ("tcp: implement coalescing on backlog queue")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Bisected-by: NJuerg Haefliger <juergh@canonical.com>
Tested-by: NJuerg Haefliger <juergh@canonical.com>
Reported-by: NHeiner Kallweit <hkallweit1@gmail.com>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=209423
Link: https://lore.kernel.org/r/20210119164900.766957-1-eric.dumazet@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

b160c285

taprio: boolean values to a bool variable · 0deee7aa

由 Jiapeng Zhong 提交于 1月 18, 2021

Fix the following coccicheck warnings:

./net/sched/sch_taprio.c:393:3-16: WARNING: Assignment of 0/1 to bool
variable.

./net/sched/sch_taprio.c:375:2-15: WARNING: Assignment of 0/1 to bool
variable.

./net/sched/sch_taprio.c:244:4-19: WARNING: Assignment of 0/1 to bool
variable.
Reported-by: NAbaci Robot <abaci@linux.alibaba.com>
Signed-off-by: NJiapeng Zhong <abaci-bugfix@linux.alibaba.com>
Link: https://lore.kernel.org/r/1610958662-71166-1-git-send-email-abaci-bugfix@linux.alibaba.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

0deee7aa

net: nfc: nci: fix the wrong NCI_CORE_INIT parameters · 4964e5a1

由 Bongsu Jeon 提交于 1月 19, 2021

Fix the code because NCI_CORE_INIT_CMD includes two parameters in NCI2.0
but there is no parameters in NCI1.x.

Fixes: bcd684aa ("net/nfc/nci: Support NCI 2.x initial sequence")
Signed-off-by: NBongsu Jeon <bongsu.jeon@samsung.com>
Link: https://lore.kernel.org/r/20210118205522.317087-1-bongsu.jeon@samsung.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

4964e5a1

net: Disable NETIF_F_HW_TLS_RX when RXCSUM is disabled · a3eb4e9d

由 Tariq Toukan 提交于 1月 17, 2021

With NETIF_F_HW_TLS_RX packets are decrypted in HW. This cannot be
logically done when RXCSUM offload is off.

Fixes: 14136564 ("net: Add TLS RX offload feature")
Signed-off-by: NTariq Toukan <tariqt@nvidia.com>
Reviewed-by: NBoris Pismenny <borisp@nvidia.com>
Link: https://lore.kernel.org/r/20210117151538.9411-1-tariqt@nvidia.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

a3eb4e9d

net: add inline function skb_csum_is_sctp · fa821170

由 Xin Long 提交于 1月 16, 2021

This patch is to define a inline function skb_csum_is_sctp(), and
also replace all places where it checks if it's a SCTP CSUM skb.
This function would be used later in many networking drivers in
the following patches.
Suggested-by: NAlexander Duyck <alexander.duyck@gmail.com>
Signed-off-by: NXin Long <lucien.xin@gmail.com>
Reviewed-by: NAlexander Duyck <alexanderduyck@fb.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

fa821170

netfilter: rpfilter: mask ecn bits before fib lookup · 2e5a6266

由 Guillaume Nault 提交于 1月 16, 2021

RT_TOS() only masks one of the two ECN bits. Therefore rpfilter_mt()
treats Not-ECT or ECT(1) packets in a different way than those with
ECT(0) or CE.

Reproducer:

  Create two netns, connected with a veth:
  $ ip netns add ns0
  $ ip netns add ns1
  $ ip link add name veth01 netns ns0 type veth peer name veth10 netns ns1
  $ ip -netns ns0 link set dev veth01 up
  $ ip -netns ns1 link set dev veth10 up
  $ ip -netns ns0 address add 192.0.2.10/32 dev veth01
  $ ip -netns ns1 address add 192.0.2.11/32 dev veth10

  Add a route to ns1 in ns0:
  $ ip -netns ns0 route add 192.0.2.11/32 dev veth01

  In ns1, only packets with TOS 4 can be routed to ns0:
  $ ip -netns ns1 route add 192.0.2.10/32 tos 4 dev veth10

  Ping from ns0 to ns1 works regardless of the ECN bits, as long as TOS
  is 4:
  $ ip netns exec ns0 ping -Q 4 192.0.2.11   # TOS 4, Not-ECT
    ... 0% packet loss ...
  $ ip netns exec ns0 ping -Q 5 192.0.2.11   # TOS 4, ECT(1)
    ... 0% packet loss ...
  $ ip netns exec ns0 ping -Q 6 192.0.2.11   # TOS 4, ECT(0)
    ... 0% packet loss ...
  $ ip netns exec ns0 ping -Q 7 192.0.2.11   # TOS 4, CE
    ... 0% packet loss ...

  Now use iptable's rpfilter module in ns1:
  $ ip netns exec ns1 iptables-legacy -t raw -A PREROUTING -m rpfilter --invert -j DROP

  Not-ECT and ECT(1) packets still pass:
  $ ip netns exec ns0 ping -Q 4 192.0.2.11   # TOS 4, Not-ECT
    ... 0% packet loss ...
  $ ip netns exec ns0 ping -Q 5 192.0.2.11   # TOS 4, ECT(1)
    ... 0% packet loss ...

  But ECT(0) and ECN packets are dropped:
  $ ip netns exec ns0 ping -Q 6 192.0.2.11   # TOS 4, ECT(0)
    ... 100% packet loss ...
  $ ip netns exec ns0 ping -Q 7 192.0.2.11   # TOS 4, CE
    ... 100% packet loss ...

After this patch, rpfilter doesn't drop ECT(0) and CE packets anymore.

Fixes: 8f97339d ("netfilter: add ipv4 reverse path filter match")
Signed-off-by: NGuillaume Nault <gnault@redhat.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

2e5a6266

udp: mask TOS bits in udp_v4_early_demux() · 8d2b51b0

由 Guillaume Nault 提交于 1月 16, 2021

udp_v4_early_demux() is the only function that calls
ip_mc_validate_source() with a TOS that hasn't been masked with
IPTOS_RT_MASK.

This results in different behaviours for incoming multicast UDPv4
packets, depending on if ip_mc_validate_source() is called from the
early-demux path (udp_v4_early_demux) or from the regular input path
(ip_route_input_noref).

ECN would normally not be used with UDP multicast packets, so the
practical consequences should be limited on that side. However,
IPTOS_RT_MASK is used to also masks the TOS' high order bits, to align
with the non-early-demux path behaviour.

Reproducer:

  Setup two netns, connected with veth:
  $ ip netns add ns0
  $ ip netns add ns1
  $ ip -netns ns0 link set dev lo up
  $ ip -netns ns1 link set dev lo up
  $ ip link add name veth01 netns ns0 type veth peer name veth10 netns ns1
  $ ip -netns ns0 link set dev veth01 up
  $ ip -netns ns1 link set dev veth10 up
  $ ip -netns ns0 address add 192.0.2.10 peer 192.0.2.11/32 dev veth01
  $ ip -netns ns1 address add 192.0.2.11 peer 192.0.2.10/32 dev veth10

  In ns0, add route to multicast address 224.0.2.0/24 using source
  address 198.51.100.10:
  $ ip -netns ns0 address add 198.51.100.10/32 dev lo
  $ ip -netns ns0 route add 224.0.2.0/24 dev veth01 src 198.51.100.10

  In ns1, define route to 198.51.100.10, only for packets with TOS 4:
  $ ip -netns ns1 route add 198.51.100.10/32 tos 4 dev veth10

  Also activate rp_filter in ns1, so that incoming packets not matching
  the above route get dropped:
  $ ip netns exec ns1 sysctl -wq net.ipv4.conf.veth10.rp_filter=1

  Now try to receive packets on 224.0.2.11:
  $ ip netns exec ns1 socat UDP-RECVFROM:1111,ip-add-membership=224.0.2.11:veth10,ignoreeof -

  In ns0, send packet to 224.0.2.11 with TOS 4 and ECT(0) (that is,
  tos 6 for socat):
  $ echo test0 | ip netns exec ns0 socat - UDP-DATAGRAM:224.0.2.11:1111,bind=:1111,tos=6

  The "test0" message is properly received by socat in ns1, because
  early-demux has no cached dst to use, so source address validation
  is done by ip_route_input_mc(), which receives a TOS that has the
  ECN bits masked.

  Now send another packet to 224.0.2.11, still with TOS 4 and ECT(0):
  $ echo test1 | ip netns exec ns0 socat - UDP-DATAGRAM:224.0.2.11:1111,bind=:1111,tos=6

  The "test1" message isn't received by socat in ns1, because, now,
  early-demux has a cached dst to use and calls ip_mc_validate_source()
  immediately, without masking the ECN bits.

Fixes: bc044e8d ("udp: perform source validation for mcast early demux")
Signed-off-by: NGuillaume Nault <gnault@redhat.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

8d2b51b0

xsk: Clear pool even for inactive queues · b425e24a

由 Maxim Mikityanskiy 提交于 1月 18, 2021

The number of queues can change by other means, rather than ethtool. For
example, attaching an mqprio qdisc with num_tc > 1 leads to creating
multiple sets of TX queues, which may be then destroyed when mqprio is
deleted. If an AF_XDP socket is created while mqprio is active,
dev->_tx[queue_id].pool will be filled, but then real_num_tx_queues may
decrease with deletion of mqprio, which will mean that the pool won't be
NULLed, and a further increase of the number of TX queues may expose a
dangling pointer.

To avoid any potential misbehavior, this commit clears pool for RX and
TX queues, regardless of real_num_*_queues, still taking into
consideration num_*_queues to avoid overflows.

Fixes: 1c1efc2a ("xsk: Create and free buffer pool independently from umem")
Fixes: a41b4f3c ("xsk: simplify xdp_clear_umem_at_qid implementation")
Signed-off-by: NMaxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NBjörn Töpel <bjorn.topel@intel.com>
Link: https://lore.kernel.org/bpf/20210118160333.333439-1-maximmi@mellanox.com

b425e24a

net: core: devlink: use right genl user_ptr when handling port param get/set · 7e238de8

由 Oleksandr Mazur 提交于 1月 19, 2021

Fix incorrect user_ptr dereferencing when handling port param get/set:

idx [0] stores the 'struct devlink' pointer;
idx [1] stores the 'struct devlink_port' pointer;

Fixes: 637989b5 ("devlink: Always use user_ptr[0] for devlink and simplify post_doit")
CC: Parav Pandit <parav@mellanox.com>
Signed-off-by: NOleksandr Mazur <oleksandr.mazur@plvision.eu>
Signed-off-by: NVadym Kochan <vadym.kochan@plvision.eu>
Link: https://lore.kernel.org/r/20210119085333.16833-1-vadym.kochan@plvision.euSigned-off-by: NJakub Kicinski <kuba@kernel.org>

7e238de8

19 1月, 2021 8 次提交

net/tls: Except bond interface from some TLS checks · 4e5a7332

由 Tariq Toukan 提交于 1月 17, 2021

In the tls_dev_event handler, ignore tlsdev_ops requirement for bond
interfaces, they do not exist as the interaction is done directly with
the lower device.

Also, make the validate function pass when it's called with the upper
bond interface.
Signed-off-by: NTariq Toukan <tariqt@nvidia.com>
Reviewed-by: NBoris Pismenny <borisp@nvidia.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

4e5a7332

net/tls: Device offload to use lowest netdevice in chain · 153cbd13

由 Tariq Toukan 提交于 1月 17, 2021

Do not call the tls_dev_ops of upper devices. Instead, ask them
for the proper lowest device and communicate with it directly.
Signed-off-by: NTariq Toukan <tariqt@nvidia.com>
Reviewed-by: NBoris Pismenny <borisp@nvidia.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

153cbd13

net: netdevice: Add operation ndo_sk_get_lower_dev · 719a402c

由 Tariq Toukan 提交于 1月 17, 2021

ndo_sk_get_lower_dev returns the lower netdev that corresponds to
a given socket.
Additionally, we implement a helper netdev_sk_get_lowest_dev() to get
the lowest one in chain.
Signed-off-by: NTariq Toukan <tariqt@nvidia.com>
Reviewed-by: NBoris Pismenny <borisp@nvidia.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

719a402c

net_sched: fix RTNL deadlock again caused by request_module() · d349f997

由 Cong Wang 提交于 1月 16, 2021

tcf_action_init_1() loads tc action modules automatically with
request_module() after parsing the tc action names, and it drops RTNL
lock and re-holds it before and after request_module(). This causes a
lot of troubles, as discovered by syzbot, because we can be in the
middle of batch initializations when we create an array of tc actions.

One of the problem is deadlock:

CPU 0					CPU 1
rtnl_lock();
for (...) {
  tcf_action_init_1();
    -> rtnl_unlock();
    -> request_module();
				rtnl_lock();
				for (...) {
				  tcf_action_init_1();
				    -> tcf_idr_check_alloc();
				   // Insert one action into idr,
				   // but it is not committed until
				   // tcf_idr_insert_many(), then drop
				   // the RTNL lock in the _next_
				   // iteration
				   -> rtnl_unlock();
    -> rtnl_lock();
    -> a_o->init();
      -> tcf_idr_check_alloc();
      // Now waiting for the same index
      // to be committed
				    -> request_module();
				    -> rtnl_lock()
				    // Now waiting for RTNL lock
				}
				rtnl_unlock();
}
rtnl_unlock();

This is not easy to solve, we can move the request_module() before
this loop and pre-load all the modules we need for this netlink
message and then do the rest initializations. So the loop breaks down
to two now:

        for (i = 1; i <= TCA_ACT_MAX_PRIO && tb[i]; i++) {
                struct tc_action_ops *a_o;

                a_o = tc_action_load_ops(name, tb[i]...);
                ops[i - 1] = a_o;
        }

        for (i = 1; i <= TCA_ACT_MAX_PRIO && tb[i]; i++) {
                act = tcf_action_init_1(ops[i - 1]...);
        }

Although this looks serious, it only has been reported by syzbot, so it
seems hard to trigger this by humans. And given the size of this patch,
I'd suggest to make it to net-next and not to backport to stable.

This patch has been tested by syzbot and tested with tdc.py by me.

Fixes: 0fedc63f ("net_sched: commit action insertions together")
Reported-and-tested-by: syzbot+82752bc5331601cf4899@syzkaller.appspotmail.com
Reported-and-tested-by: syzbot+b3b63b6bff456bd95294@syzkaller.appspotmail.com
Reported-by: syzbot+ba67b12b1ca729912834@syzkaller.appspotmail.com
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: NCong Wang <cong.wang@bytedance.com>
Tested-by: NJamal Hadi Salim <jhs@mojatatu.com>
Acked-by: NJamal Hadi Salim <jhs@mojatatu.com>
Link: https://lore.kernel.org/r/20210117005657.14810-1-xiyou.wangcong@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

d349f997

tcp: fix TCP_USER_TIMEOUT with zero window · 9d9b1ee0

由 Enke Chen 提交于 1月 15, 2021

The TCP session does not terminate with TCP_USER_TIMEOUT when data
remain untransmitted due to zero window.

The number of unanswered zero-window probes (tcp_probes_out) is
reset to zero with incoming acks irrespective of the window size,
as described in tcp_probe_timer():

    RFC 1122 4.2.2.17 requires the sender to stay open indefinitely
    as long as the receiver continues to respond probes. We support
    this by default and reset icsk_probes_out with incoming ACKs.

This counter, however, is the wrong one to be used in calculating the
duration that the window remains closed and data remain untransmitted.
Thanks to Jonathan Maxwell <jmaxwell37@gmail.com> for diagnosing the
actual issue.

In this patch a new timestamp is introduced for the socket in order to
track the elapsed time for the zero-window probes that have not been
answered with any non-zero window ack.

Fixes: 9721e709 ("tcp: simplify window probe aborting on USER_TIMEOUT")
Reported-by: NWilliam McCall <william.mccall@gmail.com>
Co-developed-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NEnke Chen <enchen@paloaltonetworks.com>
Reviewed-by: NYuchung Cheng <ycheng@google.com>
Reviewed-by: NEric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20210115223058.GA39267@localhost.localdomainSigned-off-by: NJakub Kicinski <kuba@kernel.org>

9d9b1ee0

ipv6: set multicast flag on the multicast route · ceed9038

由 Matteo Croce 提交于 1月 15, 2021

The multicast route ff00::/8 is created with type RTN_UNICAST:

  $ ip -6 -d route
  unicast ::1 dev lo proto kernel scope global metric 256 pref medium
  unicast fe80::/64 dev eth0 proto kernel scope global metric 256 pref medium
  unicast ff00::/8 dev eth0 proto kernel scope global metric 256 pref medium

Set the type to RTN_MULTICAST which is more appropriate.

Fixes: e8478e80 ("net/ipv6: Save route type in rt6_info")
Signed-off-by: NMatteo Croce <mcroce@microsoft.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

ceed9038

ipv6: create multicast route with RTPROT_KERNEL · a826b043

由 Matteo Croce 提交于 1月 15, 2021

The ff00::/8 multicast route is created without specifying the fc_protocol
field, so the default RTPROT_BOOT value is used:

  $ ip -6 -d route
  unicast ::1 dev lo proto kernel scope global metric 256 pref medium
  unicast fe80::/64 dev eth0 proto kernel scope global metric 256 pref medium
  unicast ff00::/8 dev eth0 proto boot scope global metric 256 pref medium

As the documentation says, this value identifies routes installed during
boot, but the route is created when interface is set up.
Change the value to RTPROT_KERNEL which is a better value.

Fixes: 1da177e4 ("Linux-2.6.12-rc2")
Signed-off-by: NMatteo Croce <mcroce@microsoft.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

a826b043

net: bridge: check vlan with eth_type_vlan() method · a98c0c47

由 Menglong Dong 提交于 1月 17, 2021

Replace some checks for ETH_P_8021Q and ETH_P_8021AD with
eth_type_vlan().
Signed-off-by: NMenglong Dong <dong.menglong@zte.com.cn>
Acked-by: NNikolay Aleksandrov <nikolay@nvidia.com>
Link: https://lore.kernel.org/r/20210117080950.122761-1-dong.menglong@zte.com.cnSigned-off-by: NJakub Kicinski <kuba@kernel.org>

a98c0c47

17 1月, 2021 3 次提交

sctp: remove the NETIF_F_SG flag before calling skb_segment · 1fef8544

由 Xin Long 提交于 1月 15, 2021

It makes more sense to clear NETIF_F_SG instead of set it when
calling skb_segment() in sctp_gso_segment(), since SCTP GSO is
using head_skb's fraglist, of which all frags are linear skbs.

This will make SCTP GSO code more understandable.
Suggested-by: NAlexander Duyck <alexander.duyck@gmail.com>
Signed-off-by: NXin Long <lucien.xin@gmail.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

1fef8544

net: move the hsize check to the else block in skb_segment · dbd50f23

由 Xin Long 提交于 1月 15, 2021

After commit 89319d38 ("net: Add frag_list support to skb_segment"),
it goes to process frag_list when !hsize in skb_segment(). However, when
using skb frag_list, sg normally should not be set. In this case, hsize
will be set with len right before !hsize check, then it won't go to
frag_list processing code.

So the right thing to do is move the hsize check to the else block, so
that it won't affect the !hsize check for frag_list processing.

v1->v2:
  - change to do "hsize <= 0" check instead of "!hsize", and also move
    "hsize < 0" into else block, to save some cycles, as Alex suggested.
Signed-off-by: NXin Long <lucien.xin@gmail.com>
Reviewed-by: NAlexander Duyck <alexanderduyck@fb.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

dbd50f23

skbuff: back tiny skbs with kmalloc() in __netdev_alloc_skb() too · 66c55602

由 Alexander Lobakin 提交于 1月 15, 2021

Commit 3226b158 ("net: avoid 32 x truesize under-estimation for
tiny skbs") ensured that skbs with data size lower than 1025 bytes
will be kmalloc'ed to avoid excessive page cache fragmentation and
memory consumption.
However, the fix adressed only __napi_alloc_skb() (primarily for
virtio_net and napi_get_frags()), but the issue can still be achieved
through __netdev_alloc_skb(), which is still used by several drivers.
Drivers often allocate a tiny skb for headers and place the rest of
the frame to frags (so-called copybreak).
Mirror the condition to __netdev_alloc_skb() to handle this case too.

Since v1 [0]:
 - fix "Fixes:" tag;
 - refine commit message (mention copybreak usecase).

[0] https://lore.kernel.org/netdev/20210114235423.232737-1-alobakin@pm.me

Fixes: a1c7fff7 ("net: netdev_alloc_skb() use build_skb()")
Signed-off-by: NAlexander Lobakin <alobakin@pm.me>
Link: https://lore.kernel.org/r/20210115150354.85967-1-alobakin@pm.meSigned-off-by: NJakub Kicinski <kuba@kernel.org>

66c55602

16 1月, 2021 4 次提交

tcp_cubic: use memset and offsetof init · f4d133d8

由 Yejune Deng 提交于 1月 14, 2021

In bictcp_reset(), use memset and offsetof instead of = 0.
Signed-off-by: NYejune Deng <yejune.deng@gmail.com>
Link: https://lore.kernel.org/r/1610597696-128610-1-git-send-email-yejune.deng@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

f4d133d8

nfc: netlink: use &w->w in nfc_genl_rcv_nl_event · 32d91b4a

由 Geliang Tang 提交于 1月 15, 2021

Use the struct member w of the struct urelease_work directly instead of
casting it.
Signed-off-by: NGeliang Tang <geliangtang@gmail.com>
Link: https://lore.kernel.org/r/f0ed86d6d54ac0834bd2e161d172bf7bb5647cf7.1610683862.git.geliangtang@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

32d91b4a

net: dsa: add ops for devlink-sb · 2a6ef763

由 Vladimir Oltean 提交于 1月 15, 2021

Switches that care about QoS might have hardware support for reserving
buffer pools for individual ports or traffic classes, and configuring
their sizes and thresholds. Through devlink-sb (shared buffers), this is
all configurable, as well as their occupancy being viewable.

Add the plumbing in DSA for these operations.

Individual drivers still need to call devlink_sb_register() with the
shared buffers they want to expose. A helper was not created in DSA for
this purpose (unlike, say, dsa_devlink_params_register), since in my
opinion it does not bring any benefit over plainly calling
devlink_sb_register() directly.
Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

2a6ef763

net_sched: avoid shift-out-of-bounds in tcindex_set_parms() · bcd0cf19

由 Eric Dumazet 提交于 1月 14, 2021

tc_index being 16bit wide, we need to check that TCA_TCINDEX_SHIFT
attribute is not silly.

UBSAN: shift-out-of-bounds in net/sched/cls_tcindex.c:260:29
shift exponent 255 is too large for 32-bit type 'int'
CPU: 0 PID: 8516 Comm: syz-executor228 Not tainted 5.10.0-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:79 [inline]
 dump_stack+0x107/0x163 lib/dump_stack.c:120
 ubsan_epilogue+0xb/0x5a lib/ubsan.c:148
 __ubsan_handle_shift_out_of_bounds.cold+0xb1/0x181 lib/ubsan.c:395
 valid_perfect_hash net/sched/cls_tcindex.c:260 [inline]
 tcindex_set_parms.cold+0x1b/0x215 net/sched/cls_tcindex.c:425
 tcindex_change+0x232/0x340 net/sched/cls_tcindex.c:546
 tc_new_tfilter+0x13fb/0x21b0 net/sched/cls_api.c:2127
 rtnetlink_rcv_msg+0x8b6/0xb80 net/core/rtnetlink.c:5555
 netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2494
 netlink_unicast_kernel net/netlink/af_netlink.c:1304 [inline]
 netlink_unicast+0x533/0x7d0 net/netlink/af_netlink.c:1330
 netlink_sendmsg+0x907/0xe40 net/netlink/af_netlink.c:1919
 sock_sendmsg_nosec net/socket.c:652 [inline]
 sock_sendmsg+0xcf/0x120 net/socket.c:672
 ____sys_sendmsg+0x6e8/0x810 net/socket.c:2336
 ___sys_sendmsg+0xf3/0x170 net/socket.c:2390
 __sys_sendmsg+0xe5/0x1b0 net/socket.c:2423
 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fixes: 1da177e4 ("Linux-2.6.12-rc2")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Reported-by: Nsyzbot <syzkaller@googlegroups.com>
Link: https://lore.kernel.org/r/20210114185229.1742255-1-eric.dumazet@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

bcd0cf19

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功