提交 · 3c5548812a0cf536b98f8d9f7f9377bd304809c1 · openeuler / Kernel

26 10月, 2021 40 次提交

net: ax88796c: Fix clang -Wimplicit-fallthrough in ax88796c_set_mac() · 3c554881

由 Nathan Chancellor 提交于 10月 25, 2021

Clang warns:

drivers/net/ethernet/asix/ax88796c_main.c:696:2: error: unannotated fall-through between switch labels [-Werror,-Wimplicit-fallthrough]
        case SPEED_10:
        ^
drivers/net/ethernet/asix/ax88796c_main.c:696:2: note: insert 'break;' to avoid fall-through
        case SPEED_10:
        ^
        break;
drivers/net/ethernet/asix/ax88796c_main.c:706:2: error: unannotated fall-through between switch labels [-Werror,-Wimplicit-fallthrough]
        case DUPLEX_HALF:
        ^
drivers/net/ethernet/asix/ax88796c_main.c:706:2: note: insert 'break;' to avoid fall-through
        case DUPLEX_HALF:
        ^
        break;

Clang is a little more pedantic than GCC, which permits implicit
fallthroughs to cases that contain just break or return. Clang's version
is more in line with the kernel's own stance in deprecated.rst, which
states that all switch/case blocks must end in either break,
fallthrough, continue, goto, or return. Add the missing breaks to fix
the warning.

Link: https://github.com/ClangBuiltLinux/linux/issues/1491Signed-off-by: NNathan Chancellor <nathan@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3c554881

net: mana: Allow setting the number of queues while the NIC is down · a137c069

由 Haiyang Zhang 提交于 10月 25, 2021

The existing code doesn't allow setting the number of queues while the
NIC is down.

Update the ethtool handler functions to support setting the number of
queues while the NIC is at down state.
Signed-off-by: NHaiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a137c069

net: hsr: Add support for redbox supervision frames · eafaa88b

由 Andreas Oetken 提交于 10月 25, 2021

added support for the redbox supervision frames
as defined in the IEC-62439-3:2018.
Signed-off-by: NAndreas Oetken <andreas.oetken@siemens-energy.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

eafaa88b

Merge branch 'tcp_stream_alloc_skb' · 3247e3ff

由 David S. Miller 提交于 10月 26, 2021

Eric Dumazet says:

====================
tcp: tcp_stream_alloc_skb() changes

sk_stream_alloc_skb() is only used by TCP.

Rename it to tcp_stream_alloc_skb() and apply small
optimizations.
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3247e3ff

tcp: remove unneeded code from tcp_stream_alloc_skb() · c4322884

由 Eric Dumazet 提交于 10月 25, 2021

Aligning @size argument to 4 bytes is not needed.

The header alignment has nothing to do with @size.

It really depends on skb->head alignment and MAX_TCP_HEADER.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c4322884

tcp: use MAX_TCP_HEADER in tcp_stream_alloc_skb · 8a794df6

由 Eric Dumazet 提交于 10月 25, 2021

Both IPv4 and IPv6 uses same reserve, no need risking
cache line misses to fetch its value.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8a794df6

tcp: rename sk_stream_alloc_skb · f8dd3b8d

由 Eric Dumazet 提交于 10月 25, 2021

sk_stream_alloc_skb() is only used by TCP.

Rename it to make this clear, and move its declaration
to include/net/tcp.h
Signed-off-by: NEric Dumazet <edumazet@google.com>
Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f8dd3b8d

net: annotate data-race in neigh_output() · d18785e2

由 Eric Dumazet 提交于 10月 25, 2021

neigh_output() reads n->nud_state and hh->hh_len locklessly.

This is fine, but we need to add annotations and document this.

We evaluate skip_cache first to avoid reading these fields
if the cache has to by bypassed.

syzbot report:

BUG: KCSAN: data-race in __neigh_event_send / ip_finish_output2

write to 0xffff88810798a885 of 1 bytes by interrupt on cpu 1:
 __neigh_event_send+0x40d/0xac0 net/core/neighbour.c:1128
 neigh_event_send include/net/neighbour.h:444 [inline]
 neigh_resolve_output+0x104/0x410 net/core/neighbour.c:1476
 neigh_output include/net/neighbour.h:510 [inline]
 ip_finish_output2+0x80a/0xaa0 net/ipv4/ip_output.c:221
 ip_finish_output+0x3b5/0x510 net/ipv4/ip_output.c:309
 NF_HOOK_COND include/linux/netfilter.h:296 [inline]
 ip_output+0xf3/0x1a0 net/ipv4/ip_output.c:423
 dst_output include/net/dst.h:450 [inline]
 ip_local_out+0x164/0x220 net/ipv4/ip_output.c:126
 __ip_queue_xmit+0x9d3/0xa20 net/ipv4/ip_output.c:525
 ip_queue_xmit+0x34/0x40 net/ipv4/ip_output.c:539
 __tcp_transmit_skb+0x142a/0x1a00 net/ipv4/tcp_output.c:1405
 tcp_transmit_skb net/ipv4/tcp_output.c:1423 [inline]
 tcp_xmit_probe_skb net/ipv4/tcp_output.c:4011 [inline]
 tcp_write_wakeup+0x4a9/0x810 net/ipv4/tcp_output.c:4064
 tcp_send_probe0+0x2c/0x2b0 net/ipv4/tcp_output.c:4079
 tcp_probe_timer net/ipv4/tcp_timer.c:398 [inline]
 tcp_write_timer_handler+0x394/0x520 net/ipv4/tcp_timer.c:626
 tcp_write_timer+0xb9/0x180 net/ipv4/tcp_timer.c:642
 call_timer_fn+0x2e/0x1d0 kernel/time/timer.c:1421
 expire_timers+0x135/0x240 kernel/time/timer.c:1466
 __run_timers+0x368/0x430 kernel/time/timer.c:1734
 run_timer_softirq+0x19/0x30 kernel/time/timer.c:1747
 __do_softirq+0x12c/0x26e kernel/softirq.c:558
 invoke_softirq kernel/softirq.c:432 [inline]
 __irq_exit_rcu kernel/softirq.c:636 [inline]
 irq_exit_rcu+0x4e/0xa0 kernel/softirq.c:648
 sysvec_apic_timer_interrupt+0x69/0x80 arch/x86/kernel/apic/apic.c:1097
 asm_sysvec_apic_timer_interrupt+0x12/0x20
 native_safe_halt arch/x86/include/asm/irqflags.h:51 [inline]
 arch_safe_halt arch/x86/include/asm/irqflags.h:89 [inline]
 acpi_safe_halt drivers/acpi/processor_idle.c:109 [inline]
 acpi_idle_do_entry drivers/acpi/processor_idle.c:553 [inline]
 acpi_idle_enter+0x258/0x2e0 drivers/acpi/processor_idle.c:688
 cpuidle_enter_state+0x2b4/0x760 drivers/cpuidle/cpuidle.c:237
 cpuidle_enter+0x3c/0x60 drivers/cpuidle/cpuidle.c:351
 call_cpuidle kernel/sched/idle.c:158 [inline]
 cpuidle_idle_call kernel/sched/idle.c:239 [inline]
 do_idle+0x1a3/0x250 kernel/sched/idle.c:306
 cpu_startup_entry+0x15/0x20 kernel/sched/idle.c:403
 secondary_startup_64_no_verify+0xb1/0xbb

read to 0xffff88810798a885 of 1 bytes by interrupt on cpu 0:
 neigh_output include/net/neighbour.h:507 [inline]
 ip_finish_output2+0x79a/0xaa0 net/ipv4/ip_output.c:221
 ip_finish_output+0x3b5/0x510 net/ipv4/ip_output.c:309
 NF_HOOK_COND include/linux/netfilter.h:296 [inline]
 ip_output+0xf3/0x1a0 net/ipv4/ip_output.c:423
 dst_output include/net/dst.h:450 [inline]
 ip_local_out+0x164/0x220 net/ipv4/ip_output.c:126
 __ip_queue_xmit+0x9d3/0xa20 net/ipv4/ip_output.c:525
 ip_queue_xmit+0x34/0x40 net/ipv4/ip_output.c:539
 __tcp_transmit_skb+0x142a/0x1a00 net/ipv4/tcp_output.c:1405
 tcp_transmit_skb net/ipv4/tcp_output.c:1423 [inline]
 tcp_xmit_probe_skb net/ipv4/tcp_output.c:4011 [inline]
 tcp_write_wakeup+0x4a9/0x810 net/ipv4/tcp_output.c:4064
 tcp_send_probe0+0x2c/0x2b0 net/ipv4/tcp_output.c:4079
 tcp_probe_timer net/ipv4/tcp_timer.c:398 [inline]
 tcp_write_timer_handler+0x394/0x520 net/ipv4/tcp_timer.c:626
 tcp_write_timer+0xb9/0x180 net/ipv4/tcp_timer.c:642
 call_timer_fn+0x2e/0x1d0 kernel/time/timer.c:1421
 expire_timers+0x135/0x240 kernel/time/timer.c:1466
 __run_timers+0x368/0x430 kernel/time/timer.c:1734
 run_timer_softirq+0x19/0x30 kernel/time/timer.c:1747
 __do_softirq+0x12c/0x26e kernel/softirq.c:558
 invoke_softirq kernel/softirq.c:432 [inline]
 __irq_exit_rcu kernel/softirq.c:636 [inline]
 irq_exit_rcu+0x4e/0xa0 kernel/softirq.c:648
 sysvec_apic_timer_interrupt+0x69/0x80 arch/x86/kernel/apic/apic.c:1097
 asm_sysvec_apic_timer_interrupt+0x12/0x20
 native_safe_halt arch/x86/include/asm/irqflags.h:51 [inline]
 arch_safe_halt arch/x86/include/asm/irqflags.h:89 [inline]
 acpi_safe_halt drivers/acpi/processor_idle.c:109 [inline]
 acpi_idle_do_entry drivers/acpi/processor_idle.c:553 [inline]
 acpi_idle_enter+0x258/0x2e0 drivers/acpi/processor_idle.c:688
 cpuidle_enter_state+0x2b4/0x760 drivers/cpuidle/cpuidle.c:237
 cpuidle_enter+0x3c/0x60 drivers/cpuidle/cpuidle.c:351
 call_cpuidle kernel/sched/idle.c:158 [inline]
 cpuidle_idle_call kernel/sched/idle.c:239 [inline]
 do_idle+0x1a3/0x250 kernel/sched/idle.c:306
 cpu_startup_entry+0x15/0x20 kernel/sched/idle.c:403
 rest_init+0xee/0x100 init/main.c:734
 arch_call_rest_init+0xa/0xb
 start_kernel+0x5e4/0x669 init/main.c:1142
 secondary_startup_64_no_verify+0xb1/0xbb

value changed: 0x20 -> 0x01

Reported by Kernel Concurrency Sanitizer on:
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.15.0-rc6-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Signed-off-by: NEric Dumazet <edumazet@google.com>
Reported-by: Nsyzbot <syzkaller@googlegroups.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d18785e2

Merge branch 'mlxsw-rif-mac-prefixes' · 72b93a86

由 David S. Miller 提交于 10月 26, 2021

Ido Schimmel says:

====================
mlxsw: Support multiple RIF MAC prefixes

Currently, mlxsw enforces that all the netdevs used as router interfaces
(RIFs) have the same MAC prefix (e.g., same 38 MSBs in Spectrum-1).
Otherwise, an error is returned to user space with extack. This patchset
relaxes the limitation through the use of RIF MAC profiles.

A RIF MAC profile is a hardware entity that represents a particular MAC
prefix which multiple RIFs can reference. Therefore, the number of
possible MAC prefixes is no longer one, but the number of profiles
supported by the device.

The ability to change the MAC of a particular netdev is useful, for
example, for users who use the netdev to connect to an upstream provider
that performs MAC filtering. Currently, such users are either forced to
negotiate with the provider or change the MAC address of all other
netdevs so that they share the same prefix.

Patchset overview:

Patches #1-#3 are preparations.

Patch #4 adds actual support for RIF MAC profiles.

Patch #5 exposes RIF MAC profiles as a devlink resource, so that user
space has visibility into the maximum number of profiles and current
occupancy. Useful for debugging and testing (next 3 patches).

Patches #6-#8 add both scale and functional tests.

Patch #9 removes tests that validated the previous limitation. It is now
covered by patch #6 for devices that support a single profile.
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

72b93a86

selftests: mlxsw: Remove deprecated test cases · c24dbf3d

由 Danielle Ratson 提交于 10月 26, 2021

After adding the previous patches, the constraint that all the router
interface MAC addresses have the same prefix is no longer relevant.

Remove the test cases that validated that this constraint is honored.
Signed-off-by: NDanielle Ratson <danieller@nvidia.com>
Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c24dbf3d

selftests: Add an occupancy test for RIF MAC profiles · 20d446db

由 Danielle Ratson 提交于 10月 26, 2021

When all the RIF MAC profiles are in use, test that it is possible to
change the MAC of a netdev (i.e., a RIF) when its MAC profile is not
shared with other RIFs. Test that replacement fails when the MAC profile
is shared.
Signed-off-by: NDanielle Ratson <danieller@nvidia.com>
Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

20d446db

selftests: mlxsw: Add forwarding test for RIF MAC profiles · a10b7bac

由 Danielle Ratson 提交于 10月 26, 2021

Verify that MAC profile changes are indeed applied and that packets are
forwarded with the correct source MAC.

Output example:

$ ./rif_mac_profiles.sh
TEST: h1->h2: new mac profile                                       [ OK ]
TEST: h2->h1: new mac profile                                       [ OK ]
TEST: h1->h2: edit mac profile                                      [ OK ]
TEST: h2->h1: edit mac profile                                      [ OK ]
Signed-off-by: NDanielle Ratson <danieller@nvidia.com>
Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a10b7bac

selftests: mlxsw: Add a scale test for RIF MAC profiles · 152f98e7

由 Danielle Ratson 提交于 10月 26, 2021

Query the maximum number of supported RIF MAC profiles using
devlink-resource and verify that all available MAC profiles can be utilized
and that an error is generated when user space tries to exceed this number.

Output example in Spectrum-2:

$ TESTS='rif_mac_profile' ./resource_scale.sh
TEST: 'rif_mac_profile' 4                                           [ OK ]
TEST: 'rif_mac_profile' overflow 5                                  [ OK ]
Signed-off-by: NDanielle Ratson <danieller@nvidia.com>
Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

152f98e7

mlxsw: spectrum_router: Expose RIF MAC profiles to devlink resource · 1c375ffb

由 Danielle Ratson 提交于 10月 26, 2021

Expose via devlink-resource the maximum number of RIF MAC profiles and
their current occupancy, so it can be used for debug and writing generic
tests, like in the next patch.

Example for Spectrum-2 output:

$ devlink resource show pci/0000:06:00.0
...
  name rif_mac_profiles size 4 occ 0 unit entry dpipe_tables none
Signed-off-by: NDanielle Ratson <danieller@nvidia.com>
Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1c375ffb

mlxsw: spectrum_router: Add RIF MAC profiles support · 605d25cd

由 Danielle Ratson 提交于 10月 26, 2021

Currently, mlxsw enforces that all the router interfaces (RIFs) have the
same MAC prefix.

Relax this limitation by using RIF MAC profiles. Each profile is
associated with a particular MAC prefix and multiple RIFs can use the
same profile. Therefore, the number of possible MAC prefixes is no
longer one, but the number of profiles supported by the device.

Store the profiles in an IDR and reference count them according to the
number of RIFs using them.

Associate a RIF with a profile when the RIF is created and remove the
association when the RIF is deleted.

Change the association following 'NETDEV_CHANGEADDR' events, except when
only one RIF is using the profile. In which case, change the MAC prefix
of the profile itself instead of associating the RIF with a new profile.
Signed-off-by: NDanielle Ratson <danieller@nvidia.com>
Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

605d25cd

mlxsw: spectrum_router: Propagate extack further · 26029225

由 Danielle Ratson 提交于 10月 26, 2021

The next patch will set the MAC profile of a router interface (RIF) as
part of its configure() callback. The operation can fail in case the
maximum number of profiles was exceeded.

Add extack to mlxsw_sp_rif_ops::configure() in order to communicate such
failures to user space.

In addition, the MAC profile of a RIF can change following a
'NETDEV_CHANGEADDR' notification. Propagate extack to
mlxsw_sp_router_port_change_event() so that failures could be
communicated in this path as well.

No functional changes intended.
Signed-off-by: NDanielle Ratson <danieller@nvidia.com>
Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

26029225

mlxsw: resources: Add resource identifier for RIF MAC profiles · a8428e50

由 Danielle Ratson 提交于 10月 26, 2021

Add a resource identifier for maximum RIF MAC profiles so that it could
be later used to query the information from firmware.
Signed-off-by: NDanielle Ratson <danieller@nvidia.com>
Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a8428e50

mlxsw: reg: Add MAC profile ID field to RITR register · d25d7fc3

由 Danielle Ratson 提交于 10月 26, 2021

Add MAC profile ID field to RITR register so that it could be used for
associating a RIF with a MAC profile ID by a later patch.
Signed-off-by: NDanielle Ratson <danieller@nvidia.com>
Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d25d7fc3

Merge branch 'netfilter-vrf-rework' · be348926

由 David S. Miller 提交于 10月 26, 2021

Florian Westphal says:

====================
vrf: rework interaction with netfilter/conntrack

V2:
- fix 'plain integer as null pointer' warning
- reword commit message in patch 2 to clarify loss of 'ct set untracked'

This patch series aims to solve the to-be-reverted change 09e856d5
("vrf: Reset skb conntrack connection on VRF rcv") in a different way.

Rather than have skbs pass through conntrack and nat hooks twice, suppress
conntrack invocation if the conntrack/nat hook is called from the vrf driver.

First patch deals with 'incoming connection' case:
1. suppress NAT transformations
2. skip conntrack confirmation

NAT and conntrack confirmation is done when ip/ipv6 stack calls
the postrouting hook.

Second patch deals with local packets:
in vrf driver, mark the skbs as 'untracked', so conntrack output
hook ignores them.  This skips all nat hooks as well.

Afterwards, remove the untracked state again so the second
round will pick them up.

One alternative to the chosen implementation would be to add a 'caller
id' field to 'struct nf_hook_state' and then use that, these patches
use the more straightforward check of VRF flag on the state->out device.

The two patches apply to both net and net-next, i am targeting -next
because I think that since snat did not work correctly for so long that
we can take the longer route.  If you disagree, apply to net at your
discretion.

The patches apply both with 09e856d5 reverted or still
in-place, but only with the revert in place ingress conntrack settings
(zone, notrack etc) start working again.

I've already submitted selftests for vrf+nfqueue and conntrack+vrf.
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

be348926

vrf: run conntrack only in context of lower/physdev for locally generated packets · 8c9c296a

由 Florian Westphal 提交于 10月 25, 2021

The VRF driver invokes netfilter for output+postrouting hooks so that users
can create rules that check for 'oif $vrf' rather than lower device name.

This is a problem when NAT rules are configured.

To avoid any conntrack involvement in round 1, tag skbs as 'untracked'
to prevent conntrack from picking them up.

This gets cleared before the packet gets handed to the ip stack so
conntrack will be active on the second iteration.

One remaining issue is that a rule like

  output ... oif $vrfname notrack

won't propagate to the second round because we can't tell
'notrack set via ruleset' and 'notrack set by vrf driver' apart.
However, this isn't a regression: the 'notrack' removal happens
instead of unconditional nf_reset_ct().
I'd also like to avoid leaking more vrf specific conditionals into the
netfilter infra.

For ingress, conntrack has already been done before the packet makes it
to the vrf driver, with this patch egress does connection tracking with
lower/physical device as well.
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Acked-by: NDavid Ahern <dsahern@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8c9c296a

netfilter: conntrack: skip confirmation and nat hooks in postrouting for vrf · 8e0538d8

由 Florian Westphal 提交于 10月 25, 2021

The VRF driver invokes netfilter for output+postrouting hooks so that users
can create rules that check for 'oif $vrf' rather than lower device name.

Afterwards, ip stack calls those hooks again.

This is a problem when conntrack is used with IP masquerading.
masquerading has an internal check that re-validates the output
interface to account for route changes.

This check will trigger in the vrf case.

If the -j MASQUERADE rule matched on the first iteration, then round 2
finds state->out->ifindex != nat->masq_index: the latter is the vrf
index, but out->ifindex is the lower device.

The packet gets dropped and the conntrack entry is invalidated.

This change makes conntrack postrouting skip the nat hooks.
Also skip confirmation.  This allows the second round
(postrouting invocation from ipv4/ipv6) to create nat bindings.

This also prevents the second round from seeing packets that had their
source address changed by the nat hook.
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8e0538d8

Merge tag 'mlx5-updates-2021-10-25' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux · 4900a769

由 David S. Miller 提交于 10月 26, 2021

Saeed Mahameed says:

====================
mlx5-updates-2021-10-25

Misc updates for mlx5 driver:

1) Misc updates and cleanups:
 - Don't write directly to netdev->dev_addr, From Jakub Kicinski
 - Remove unnecessary checks for slow path flag in tc module
 - Fix unused function warning of mlx5i_flow_type_mask
 - Bridge, support replacing existing FDB entry

2) Sub Functions, Reduction in memory usage:
 - Reduce flow counters bulk query buffer size
 - Implement max_macs devlink parameter
 - Add devlink vendor params to control Event Queue sizes
 - Added SF life cycle trace points by Parav/

3) From Aya, Firmware health buffer reporting improvements
 - Print health buffer by log level and more missing information
 - Periodic update of host time to firmware
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4900a769

tcp: don't free a FIN sk_buff in tcp_remove_empty_skb() · cf12e6f9

由 Jon Maxwell 提交于 10月 25, 2021

v1: Implement a more general statement as recommended by Eric Dumazet. The
sequence number will be advanced, so this check will fix the FIN case and
other cases.

A customer reported sockets stuck in the CLOSING state. A Vmcore revealed that
the write_queue was not empty as determined by tcp_write_queue_empty() but the
sk_buff containing the FIN flag had been freed and the socket was zombied in
that state. Corresponding pcaps show no FIN from the Linux kernel on the wire.

Some instrumentation was added to the kernel and it was found that there is a
timing window where tcp_sendmsg() can run after tcp_send_fin().

tcp_sendmsg() will hit an error, for example:

1269 ▹ if (sk->sk_err || (sk->sk_shutdown & SEND_SHUTDOWN))↩
1270 ▹ ▹ goto do_error;↩

tcp_remove_empty_skb() will then free the FIN sk_buff as "skb->len == 0". The
TCP socket is now wedged in the FIN-WAIT-1 state because the FIN is never sent.

If the other side sends a FIN packet the socket will transition to CLOSING and
remain that way until the system is rebooted.

Fix this by checking for the FIN flag in the sk_buff and don't free it if that
is the case. Testing confirmed that fixed the issue.

Fixes: fdfc5c85 ("tcp: remove empty skb from write queue in error cases")
Signed-off-by: NJon Maxwell <jmaxwell37@gmail.com>
Reported-by: NMonir Zouaoui <Monir.Zouaoui@mail.schwarz>
Reported-by: NSimon Stier <simon.stier@mail.schwarz>
Reviewed-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

cf12e6f9

Merge branch 'small-fixes-for-true-expression-checks' · 36d935a0

由 Jakub Kicinski 提交于 10月 25, 2021

Jean Sacren says:

====================
Small fixes for true expression checks

This series fixes checks of true !rc expression.
====================

Link: https://lore.kernel.org/r/cover.1634974124.git.sakiwit@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

36d935a0

net: qed_dev: fix check of true !rc expression · 036f590f

由 Jean Sacren 提交于 10月 23, 2021

Remove the check of !rc in (!rc && !resc_lock_params.b_granted) since it
is always true.
Signed-off-by: NJean Sacren <sakiwit@gmail.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

036f590f

net: qed_ptp: fix check of true !rc expression · 165f8e82

由 Jean Sacren 提交于 10月 23, 2021

Remove the check of !rc in (!rc && !params.b_granted) since it is always
true.

We should also use constant 0 for return.
Signed-off-by: NJean Sacren <sakiwit@gmail.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

165f8e82

Merge branch 'tcp-receive-path-optimizations' · e43b76ab

由 Jakub Kicinski 提交于 10月 25, 2021

Eric Dumazet says:

====================
tcp: receive path optimizations

This series aims to reduce cache line misses in RX path.

I am still working on better cache locality in tcp_sock but
this will wait few more weeks.
====================

Link: https://lore.kernel.org/r/20211025164825.259415-1-eric.dumazet@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

e43b76ab

ipv6/tcp: small drop monitor changes · 12c8691d

由 Eric Dumazet 提交于 10月 25, 2021

Two kfree_skb() calls must be replaced by consume_skb()
for skbs that are not technically dropped.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

12c8691d

ipv4: guard IP_MINTTL with a static key · 020e71a3

由 Eric Dumazet 提交于 10月 25, 2021

RFC 5082 IP_MINTTL option is rarely used on hosts.

Add a static key to remove from TCP fast path useless code,
and potential cache line miss to fetch inet_sk(sk)->min_ttl

Note that once ip4_min_ttl static key has been enabled,
it stays enabled until next boot.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

020e71a3

ipv4: annotate data races arount inet->min_ttl · 14834c4f

由 Eric Dumazet 提交于 10月 25, 2021

No report yet from KCSAN, yet worth documenting the races.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

14834c4f

ipv6: guard IPV6_MINHOPCOUNT with a static key · 790eb673

由 Eric Dumazet 提交于 10月 25, 2021

RFC 5082 IPV6_MINHOPCOUNT is rarely used on hosts.

Add a static key to remove from TCP fast path useless code,
and potential cache line miss to fetch tcp_inet6_sk(sk)->min_hopcount

Note that once ip6_min_hopcount static key has been enabled,
it stays enabled until next boot.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

790eb673

ipv6: annotate data races around np->min_hopcount · cc17c3c8

由 Eric Dumazet 提交于 10月 25, 2021

No report yet from KCSAN, yet worth documenting the races.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

cc17c3c8

net: annotate accesses to sk->sk_rx_queue_mapping · 09b89846

由 Eric Dumazet 提交于 10月 25, 2021

sk->sk_rx_queue_mapping can be modified locklessly,
add a couple of READ_ONCE()/WRITE_ONCE() to document this fact.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

09b89846

net: avoid dirtying sk->sk_rx_queue_mapping · 342159ee

由 Eric Dumazet 提交于 10月 25, 2021

sk_rx_queue_mapping is located in a cache line that should be kept read mostly.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

342159ee

net: avoid dirtying sk->sk_napi_id · 2b13af8a

由 Eric Dumazet 提交于 10月 25, 2021

sk_napi_id is located in a cache line that can be kept read mostly.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

2b13af8a

ipv6: move inet6_sk(sk)->rx_dst_cookie to sk->sk_rx_dst_cookie · ef57c161

由 Eric Dumazet 提交于 10月 25, 2021

Increase cache locality by moving rx_dst_coookie next to sk->sk_rx_dst

This removes one or two cache line misses in IPv6 early demux (TCP/UDP)
Signed-off-by: NEric Dumazet <edumazet@google.com>
Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

ef57c161

tcp: move inet->rx_dst_ifindex to sk->sk_rx_dst_ifindex · 0c0a5ef8

由 Eric Dumazet 提交于 10月 25, 2021

Increase cache locality by moving rx_dst_ifindex next to sk->sk_rx_dst

This is part of an effort to reduce cache line misses in TCP fast path.

This removes one cache line miss in early demux.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

0c0a5ef8

ax88796c: fix fetching error stats from percpu containers · fd559a94

由 Alexander Lobakin 提交于 10月 23, 2021

rx_dropped, tx_dropped, rx_frame_errors and rx_crc_errors are being
wrongly fetched from the target container rather than source percpu
ones.
No idea if that goes from the vendor driver or was brainoed during
the refactoring, but fix it either way.

Fixes: a97c69ba ("net: ax88796c: ASIX AX88796C SPI Ethernet Adapter Driver")
Signed-off-by: NAlexander Lobakin <alobakin@pm.me>
Acked-by: NŁukasz Stelmach <l.stelmach@samsung.com>
Link: https://lore.kernel.org/r/20211023121148.113466-1-alobakin@pm.meSigned-off-by: NJakub Kicinski <kuba@kernel.org>

fd559a94

net/mlx5: SF_DEV Add SF device trace points · d67ab0a8

由 Parav Pandit 提交于 10月 05, 2021

Add SF device add and delete specific trace points.

echo mlx5:mlx5_sf_dev_add >> /sys/kernel/debug/tracing/set_event
echo mlx5:mlx5_sf_dev_del >> /sys/kernel/debug/tracing/set_event
echo mlx5:mlx5_sf_vhca_event >> /sys/kernel/debug/tracing/set_event
Signed-off-by: NParav Pandit <parav@nvidia.com>
Signed-off-by: NSaeed Mahameed <saeedm@nvidia.com>

d67ab0a8

net/mlx5: SF, Add SF trace points · b3ccada6

由 Parav Pandit 提交于 9月 21, 2021

Add support for trace events for SFs to improve debugging.
This covers
(a) port add and free trace points
(b) device level trace points
(c) SF hardware context add, free trace points.
(d) SF function activate/deacticate and state trace points

SF events examples:
echo mlx5:mlx5_sf_add >> /sys/kernel/debug/tracing/set_event
echo mlx5:mlx5_sf_free >> /sys/kernel/debug/tracing/set_event
echo mlx5:mlx5_sf_hwc_alloc >> /sys/kernel/debug/tracing/set_event
echo mlx5:mlx5_sf_hwc_free >> /sys/kernel/debug/tracing/set_event
echo mlx5:mlx5_sf_hwc_deferred_free >> /sys/kernel/debug/tracing/set_event
echo mlx5:mlx5_sf_update_state >> /sys/kernel/debug/tracing/set_event
echo mlx5:mlx5_sf_activate >> /sys/kernel/debug/tracing/set_event
echo mlx5:mlx5_sf_deactivate >> /sys/kernel/debug/tracing/set_event
Signed-off-by: NParav Pandit <parav@nvidia.com>
Signed-off-by: NSaeed Mahameed <saeedm@nvidia.com>

b3ccada6

openeuler / Kernel 接近 2 年 前同步成功

openeuler / Kernel
接近 2 年前同步成功