提交 · 27c5f74c7ba7b782f9694589ae733ca2ca8f76cc · openeuler / Kernel

16 2月, 2022 4 次提交

net: bridge: vlan: notify switchdev only when something changed · 27c5f74c

由 Vladimir Oltean 提交于 2月 15, 2022

Currently, when a VLAN entry is added multiple times in a row to a
bridge port, nbp_vlan_add() calls br_switchdev_port_vlan_add() each
time, even if the VLAN already exists and nothing about it has changed:

bridge vlan add dev lan12 vid 100 master static

Similarly, when a VLAN is added multiple times in a row to a bridge,
br_vlan_add_existing() doesn't filter at all the calls to
br_switchdev_port_vlan_add():

bridge vlan add dev br0 vid 100 self

This behavior makes driver-level accounting of VLANs impossible, since
it is enough for a single deletion event to remove a VLAN, but the
addition event can be emitted an unlimited number of times.

The cause for this can be identified as follows: we rely on
__vlan_add_flags() to retroactively tell us whether it has changed
anything about the VLAN flags or VLAN group pvid. So we'd first have to
call __vlan_add_flags() before calling br_switchdev_port_vlan_add(), in
order to have access to the "bool *changed" information. But we don't
want to change the event ordering, because we'd have to revert the
struct net_bridge_vlan changes we've made if switchdev returns an error.

So to solve this, we need another function that tells us whether any
change is going to occur in the VLAN or VLAN group, _prior_ to calling
__vlan_add_flags().

Split __vlan_add_flags() into a precommit and a commit stage, and rename
it to __vlan_flags_update(). The precommit stage,
__vlan_flags_would_change(), will determine whether there is any reason
to notify switchdev due to a change of flags (note: the BRENTRY flag
transition from false to true is treated separately: as a new switchdev
entry, because we skipped notifying the master VLAN when it wasn't a
brentry yet, and therefore not as a change of flags).

With this lookahead/precommit function in place, we can avoid notifying
switchdev if nothing changed for the VLAN and VLAN group.
Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

27c5f74c

net: bridge: vlan: make __vlan_add_flags react only to PVID and UNTAGGED · cab2cd77

由 Vladimir Oltean 提交于 2月 15, 2022

Currently there is a very subtle aspect to the behavior of
__vlan_add_flags(): it changes the struct net_bridge_vlan flags and
pvid, yet it returns true ("changed") even if none of those changed,
just a transition of br_vlan_is_brentry(v) took place from false to
true.

This can be seen in br_vlan_add_existing(), however we do not actually
rely on this subtle behavior, since the "if" condition that checks that
the vlan wasn't a brentry before had a useless (until now) assignment:

	*changed = true;

Make things more obvious by actually making __vlan_add_flags() do what's
written on the box, and be more specific about what is actually written
on the box. This is needed because further transformations will be done
to __vlan_add_flags().
Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: NNikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

cab2cd77

net: bridge: vlan: don't notify to switchdev master VLANs without BRENTRY flag · 3116ad06

由 Vladimir Oltean 提交于 2月 15, 2022

When a VLAN is added to a bridge port and it doesn't exist on the bridge
device yet, it gets created for the multicast context, but it is
'hidden', since it doesn't have the BRENTRY flag yet:

ip link add br0 type bridge && ip link set swp0 master br0
bridge vlan add dev swp0 vid 100 # the master VLAN 100 gets created
bridge vlan add dev br0 vid 100 self # that VLAN becomes brentry just now

All switchdev drivers ignore switchdev notifiers for VLAN entries which
have the BRENTRY unset, and for good reason: these are merely private
data structures used by the bridge driver. So we might just as well not
notify those at all.

Cleanup in the switchdev drivers that check for the BRENTRY flag is now
possible, and will be handled separately, since those checks just became
dead code.
Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: NNikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3116ad06

net: bridge: vlan: check early for lack of BRENTRY flag in br_vlan_add_existing · b2bc58d4

由 Vladimir Oltean 提交于 2月 15, 2022

When a VLAN is added to a bridge port, a master VLAN gets created on the
bridge for context, but it doesn't have the BRENTRY flag.

Then, when the same VLAN is added to the bridge itself, that enters
through the br_vlan_add_existing() code path and gains the BRENTRY flag,
thus it becomes "existing".

It seems natural to check for this condition early, because the current
code flow is to notify switchdev of the addition of a VLAN that isn't a
brentry, just to delete it immediately afterwards.
Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: NNikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b2bc58d4

15 2月, 2022 1 次提交

net: bridge: vlan: check for errors from __vlan_del in __vlan_flush · 5454f5c2

由 Vladimir Oltean 提交于 2月 15, 2022

If the following call path returns an error from switchdev:

nbp_vlan_flush
-> __vlan_del
   -> __vlan_vid_del
      -> br_switchdev_port_vlan_del
-> __vlan_group_free
   -> WARN_ON(!list_empty(&vg->vlan_list));

then the deletion of the net_bridge_vlan is silently halted, which will
trigger the WARN_ON from __vlan_group_free().

The WARN_ON is rather unhelpful, because nothing about the source of the
error is printed. Add a print to catch errors from __vlan_del.
Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5454f5c2

14 2月, 2022 8 次提交

ipv6: blackhole_netdev needs snmp6 counters · dd263a8c

由 Ido Schimmel 提交于 2月 13, 2022

Whenever rt6_uncached_list_flush_dev() swaps rt->rt6_idev
to the blackhole device, parts of IPv6 stack might still need
to increment one SNMP counter.

Root cause, patch from Ido, changelog from Eric :)

This bug suggests that we need to audit rt->rt6_idev usages
and make sure they are properly using RCU protection.

Fixes: e5f80fcf ("ipv6: give an IPv6 dev to blackhole_netdev")
Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Reported-by: Nsyzbot <syzkaller@googlegroups.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

dd263a8c

net: dev: Make rps_lock() disable interrupts. · e722db8d

由 Sebastian Andrzej Siewior 提交于 2月 12, 2022

Disabling interrupts and in the RPS case locking input_pkt_queue is
split into local_irq_disable() and optional spin_lock().

This breaks on PREEMPT_RT because the spinlock_t typed lock can not be
acquired with disabled interrupts.
The sections in which the lock is acquired is usually short in a sense that it
is not causing long und unbounded latiencies. One exception is the
skb_flow_limit() invocation which may invoke a BPF program (and may
require sleeping locks).

By moving local_irq_disable() + spin_lock() into rps_lock(), we can keep
interrupts disabled on !PREEMPT_RT and enabled on PREEMPT_RT kernels.
Without RPS on a PREEMPT_RT kernel, the needed synchronisation happens
as part of local_bh_disable() on the local CPU.
____napi_schedule() is only invoked if sd is from the local CPU. Replace
it with __napi_schedule_irqoff() which already disables interrupts on
PREEMPT_RT as needed. Move this call to rps_ipi_queued() and rename the
function to napi_schedule_rps as suggested by Jakub.
Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: NJakub Kicinski <kuba@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e722db8d

net: dev: Makes sure netif_rx() can be invoked in any context. · baebdf48

由 Sebastian Andrzej Siewior 提交于 2月 12, 2022

Dave suggested a while ago (eleven years by now) "Let's make netif_rx()
work in all contexts and get rid of netif_rx_ni()". Eric agreed and
pointed out that modern devices should use netif_receive_skb() to avoid
the overhead.
In the meantime someone added another variant, netif_rx_any_context(),
which behaves as suggested.

netif_rx() must be invoked with disabled bottom halves to ensure that
pending softirqs, which were raised within the function, are handled.
netif_rx_ni() can be invoked only from process context (bottom halves
must be enabled) because the function handles pending softirqs without
checking if bottom halves were disabled or not.
netif_rx_any_context() invokes on the former functions by checking
in_interrupts().

netif_rx() could be taught to handle both cases (disabled and enabled
bottom halves) by simply disabling bottom halves while invoking
netif_rx_internal(). The local_bh_enable() invocation will then invoke
pending softirqs only if the BH-disable counter drops to zero.

Eric is concerned about the overhead of BH-disable+enable especially in
regard to the loopback driver. As critical as this driver is, it will
receive a shortcut to avoid the additional overhead which is not needed.

Add a local_bh_disable() section in netif_rx() to ensure softirqs are
handled if needed.
Provide __netif_rx() which does not disable BH and has a lockdep assert
to ensure that interrupts are disabled. Use this shortcut in the
loopback driver and in drivers/net/*.c.
Make netif_rx_ni() and netif_rx_any_context() invoke netif_rx() so they
can be removed once they are no more users left.

Link: https://lkml.kernel.org/r/20100415.020246.218622820.davem@davemloft.netSigned-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: NEric Dumazet <edumazet@google.com>
Reviewed-by: NToke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

baebdf48

net: dev: Remove preempt_disable() and get_cpu() in netif_rx_internal(). · f234ae29

由 Sebastian Andrzej Siewior 提交于 2月 12, 2022

The preempt_disable() () section was introduced in commit
    cece1945 ("net: disable preemption before call smp_processor_id()")

and adds it in case this function is invoked from preemtible context and
because get_cpu() later on as been added.

The get_cpu() usage was added in commit
    b0e28f1e ("net: netif_rx() must disable preemption")

because ip_dev_loopback_xmit() invoked netif_rx() with enabled preemption
causing a warning in smp_processor_id(). The function netif_rx() should
only be invoked from an interrupt context which implies disabled
preemption. The commit
   e30b38c2 ("ip: Fix ip_dev_loopback_xmit()")

was addressing this and replaced netif_rx() with in netif_rx_ni() in
ip_dev_loopback_xmit().

Based on the discussion on the list, the former patch (b0e28f1e)
should not have been applied only the latter (e30b38c2).

Remove get_cpu() and preempt_disable() since the function is supposed to
be invoked from context with stable per-CPU pointers. Bottom halves have
to be disabled at this point because the function may raise softirqs
which need to be processed.

Link: https://lkml.kernel.org/r/20100415.013347.98375530.davem@davemloft.netSigned-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: NEric Dumazet <edumazet@google.com>
Reviewed-by: NToke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f234ae29

ipv6: Add reasons for skb drops to __udp6_lib_rcv · 4cf91f82

由 David Ahern 提交于 2月 11, 2022

Add reasons to __udp6_lib_rcv for skb drops. The only twist is that the
NO_SOCKET takes precedence over the CSUM or other counters for that
path (motivation behind this patch - csum counter was misleading).
Signed-off-by: NDavid Ahern <dsahern@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4cf91f82

net/smc: Add comment for smc_tx_pending · 2e13bde1

由 Tony Lu 提交于 2月 11, 2022

The previous patch introduces a lock-free version of smc_tx_work() to
solve unnecessary lock contention, which is expected to be held lock.
So this adds comment to remind people to keep an eye out for locks.
Suggested-by: NStefan Raspl <raspl@linux.ibm.com>
Signed-off-by: NTony Lu <tonylu@linux.alibaba.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2e13bde1

Generate netlink notification when default IPv6 route preference changes · 806c37dd

由 Kalash Nainwal 提交于 2月 10, 2022

Generate RTM_NEWROUTE netlink notification when the route preference
 changes on an existing kernel generated default route in response to
 RA messages. Currently netlink notifications are generated only when
 this route is added or deleted but not when the route preference
 changes, which can cause userspace routing application state to go
 out of sync with kernel.
Signed-off-by: NKalash Nainwal <kalash@arista.com>
Reviewed-by: NDavid Ahern <dsahern@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

806c37dd

net/sched: act_police: more accurate MTU policing · 4ddc844e

由 Davide Caratti 提交于 2月 10, 2022

in current Linux, MTU policing does not take into account that packets at
the TC ingress have the L2 header pulled. Thus, the same TC police action
(with the same value of tcfp_mtu) behaves differently for ingress/egress.
In addition, the full GSO size is compared to tcfp_mtu: as a consequence,
the policer drops GSO packets even when individual segments have the L2 +
L3 + L4 + payload length below the configured valued of tcfp_mtu.

Improve the accuracy of MTU policing as follows:
 - account for mac_len for non-GSO packets at TC ingress.
 - compare MTU threshold with the segmented size for GSO packets.
Also, add a kselftest that verifies the correct behavior.
Signed-off-by: NDavide Caratti <dcaratti@redhat.com>
Reviewed-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4ddc844e

11 2月, 2022 13 次提交

ipv4: add (struct uncached_list)->quarantine list · 29e5375d

由 Eric Dumazet 提交于 2月 10, 2022

This is an optimization to keep the per-cpu lists as short as possible:

Whenever rt_flush_dev() changes one rtable dst.dev
matching the disappearing device, it can can transfer the object
to a quarantine list, waiting for a final rt_del_uncached_list().
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

29e5375d

ipv6: add (struct uncached_list)->quarantine list · ba55ef81

由 Eric Dumazet 提交于 2月 10, 2022

This is an optimization to keep the per-cpu lists as short as possible:

Whenever rt6_uncached_list_flush_dev() changes one rt6_info
matching the disappearing device, it can can transfer the object
to a quarantine list, waiting for a final rt6_uncached_list_del().
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ba55ef81

ipv6: give an IPv6 dev to blackhole_netdev · e5f80fcf

由 Eric Dumazet 提交于 2月 10, 2022

IPv6 addrconf notifiers wants the loopback device to
be the last device being dismantled at netns deletion.

This caused many limitations and work arounds.

Back in linux-5.3, Mahesh added a per host blackhole_netdev
that can be used whenever we need to make sure objects no longer
refer to a disappearing device.

If we attach to blackhole_netdev an ip6_ptr (allocate an idev),
then we can use this special device (which is never freed)
in place of the loopback_dev (which can be freed).

This will permit improvements in netdev_run_todo() and other parts
of the stack where had steps to make sure loopback_dev was
the last device to disappear.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e5f80fcf

ipv6: get rid of net->ipv6.rt6_stats->fib_rt_uncache · 2d4feb2c

由 Eric Dumazet 提交于 2月 10, 2022

This counter has never been visible, there is little point
trying to maintain it.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2d4feb2c

ipv6: Reject routes configurations that specify dsfield (tos) · b9605161

由 Guillaume Nault 提交于 2月 10, 2022

The ->rtm_tos option is normally used to route packets based on both
the destination address and the DS field. However it's ignored for
IPv6 routes. Setting ->rtm_tos for IPv6 is thus invalid as the route
is going to work only on the destination address anyway, so it won't
behave as specified.
Suggested-by: NToke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: NGuillaume Nault <gnault@redhat.com>
Reviewed-by: NDavid Ahern <dsahern@kernel.org>
Reviewed-by: NShuah Khan <skhan@linuxfoundation.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b9605161

net: dsa: remove lockdep class for DSA slave address list · ddb44bdc

由 Vladimir Oltean 提交于 2月 10, 2022

Since commit 2f1e8ea7 ("net: dsa: link interfaces with the DSA
master to get rid of lockdep warnings"), suggested by Cong Wang, the
DSA interfaces and their master have different dev->nested_level, which
makes netif_addr_lock() stop complaining about potentially recursive
locking on the same lock class.

So we no longer need DSA slave interfaces to have their own lockdep
class.

Cc: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ddb44bdc

net: dsa: remove lockdep class for DSA master address list · 8db2bc79

由 Vladimir Oltean 提交于 2月 10, 2022

Since commit 2f1e8ea7 ("net: dsa: link interfaces with the DSA
master to get rid of lockdep warnings"), suggested by Cong Wang, the
DSA interfaces and their master have different dev->nested_level, which
makes netif_addr_lock() stop complaining about potentially recursive
locking on the same lock class.

So we no longer need DSA masters to have their own lockdep class.

Cc: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8db2bc79

net: dsa: remove ndo_get_phys_port_name and ndo_get_port_parent_id · 45b987d5

由 Vladimir Oltean 提交于 2月 10, 2022

There are no legacy ports, DSA registers a devlink instance with ports
unconditionally for all switch drivers. Therefore, delete the old-style
ndo operations used for determining bridge forwarding domains.
Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
Tested-by: NFlorian Fainelli <f.fainelli@gmail.com>
Reviewed-by: NIdo Schimmel <idosch@nvidia.com>
Reviewed-by: NJiri Pirko <jiri@nvidia.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

45b987d5

net/smc: Add global configure for handshake limitation by netlink · f9496b7c

由 D. Wythe 提交于 2月 10, 2022

Although we can control SMC handshake limitation through socket options,
which means that applications who need it must modify their code. It's
quite troublesome for many existing applications. This patch modifies
the global default value of SMC handshake limitation through netlink,
providing a way to put constraint on handshake without modifies any code
for applications.
Suggested-by: NTony Lu <tonylu@linux.alibaba.com>
Signed-off-by: ND. Wythe <alibuda@linux.alibaba.com>
Reviewed-by: NTony Lu <tonylu@linux.alibaba.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f9496b7c

net/smc: Dynamic control handshake limitation by socket options · a6a6fe27

由 D. Wythe 提交于 2月 10, 2022

This patch aims to add dynamic control for SMC handshake limitation for
every smc sockets, in production environment, it is possible for the
same applications to handle different service types, and may have
different opinion on SMC handshake limitation.

This patch try socket options to complete it, since we don't have socket
option level for SMC yet, which requires us to implement it at the same
time.

This patch does the following:

- add new socket option level: SOL_SMC.
- add new SMC socket option: SMC_LIMIT_HS.
- provide getter/setter for SMC socket options.

Link: https://lore.kernel.org/all/20f504f961e1a803f85d64229ad84260434203bd.1644323503.git.alibuda@linux.alibaba.com/Signed-off-by: ND. Wythe <alibuda@linux.alibaba.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a6a6fe27

net/smc: Limit SMC visits when handshake workqueue congested · 48b6190a

由 D. Wythe 提交于 2月 10, 2022

This patch intends to provide a mechanism to put constraint on SMC
connections visit according to the pressure of SMC handshake process.
At present, frequent visits will cause the incoming connections to be
backlogged in SMC handshake queue, raise the connections established
time. Which is quite unacceptable for those applications who base on
short lived connections.

There are two ways to implement this mechanism:

1. Put limitation after TCP established.
2. Put limitation before TCP established.

In the first way, we need to wait and receive CLC messages that the
client will potentially send, and then actively reply with a decline
message, in a sense, which is also a sort of SMC handshake, affect the
connections established time on its way.

In the second way, the only problem is that we need to inject SMC logic
into TCP when it is about to reply the incoming SYN, since we already do
that, it's seems not a problem anymore. And advantage is obvious, few
additional processes are required to complete the constraint.

This patch use the second way. After this patch, connections who beyond
constraint will not informed any SMC indication, and SMC will not be
involved in any of its subsequent processes.

Link: https://lore.kernel.org/all/1641301961-59331-1-git-send-email-alibuda@linux.alibaba.com/Signed-off-by: ND. Wythe <alibuda@linux.alibaba.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

48b6190a

net/smc: Limit backlog connections · 8270d9c2

由 D. Wythe 提交于 2月 10, 2022

Current implementation does not handling backlog semantics, one
potential risk is that server will be flooded by infinite amount
connections, even if client was SMC-incapable.

This patch works to put a limit on backlog connections, referring to the
TCP implementation, we divides SMC connections into two categories:

1. Half SMC connection, which includes all TCP established while SMC not
connections.

2. Full SMC connection, which includes all SMC established connections.

For half SMC connection, since all half SMC connections starts with TCP
established, we can achieve our goal by put a limit before TCP
established. Refer to the implementation of TCP, this limits will based
on not only the half SMC connections but also the full connections,
which is also a constraint on full SMC connections.

For full SMC connections, although we know exactly where it starts, it's
quite hard to put a limit before it. The easiest way is to block wait
before receive SMC confirm CLC message, while it's under protection by
smc_server_lgr_pending, a global lock, which leads this limit to the
entire host instead of a single listen socket. Another way is to drop
the full connections, but considering the cast of SMC connections, we
prefer to keep full SMC connections.

Even so, the limits of full SMC connections still exists, see commits
about half SMC connection below.

After this patch, the limits of backend connection shows like:

For SMC:

1. Client with SMC-capability can makes 2 * backlog full SMC connections
   or 1 * backlog half SMC connections and 1 * backlog full SMC
   connections at most.

2. Client without SMC-capability can only makes 1 * backlog half TCP
   connections and 1 * backlog full TCP connections.
Signed-off-by: ND. Wythe <alibuda@linux.alibaba.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8270d9c2

net/smc: Make smc_tcp_listen_work() independent · 3079e342

由 D. Wythe 提交于 2月 10, 2022

In multithread and 10K connections benchmark, the backend TCP connection
established very slowly, and lots of TCP connections stay in SYN_SENT
state.

Client: smc_run wrk -c 10000 -t 4 http://server

the netstate of server host shows like:
    145042 times the listen queue of a socket overflowed
    145042 SYNs to LISTEN sockets dropped

One reason of this issue is that, since the smc_tcp_listen_work() shared
the same workqueue (smc_hs_wq) with smc_listen_work(), while the
smc_listen_work() do blocking wait for smc connection established. Once
the workqueue became congested, it's will block the accept() from TCP
listen.

This patch creates a independent workqueue(smc_tcp_ls_wq) for
smc_tcp_listen_work(), separate it from smc_listen_work(), which is
quite acceptable considering that smc_tcp_listen_work() runs very fast.
Signed-off-by: ND. Wythe <alibuda@linux.alibaba.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3079e342

10 2月, 2022 13 次提交

net/switchdev: use struct_size over open coded arithmetic · d8c28581

由 Minghao Chi (CGEL ZTE) 提交于 2月 10, 2022

Replace zero-length array with flexible-array member and make use
of the struct_size() helper in kmalloc(). For example:

struct switchdev_deferred_item {
    ...
    unsigned long data[];
};

Make use of the struct_size() helper instead of an open-coded version
in order to avoid any potential type mistakes.
Reported-by: NZeal Robot <zealci@zte.com.cn>
Signed-off-by: NMinghao Chi (CGEL ZTE) <chi.minghao@zte.com.cn>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d8c28581

ipv4: Reject again rules with high DSCP values · dc513a40

由 Guillaume Nault 提交于 2月 10, 2022

Commit 563f8e97 ("ipv4: Stop taking ECN bits into account in
fib4-rules") replaced the validation test on frh->tos. While the new
test is stricter for ECN bits, it doesn't detect the use of high order
DSCP bits. This would be fine if IPv4 could properly handle them. But
currently, most IPv4 lookups are done with the three high DSCP bits
masked. Therefore, using these bits doesn't lead to the expected
result.

Let's reject such configurations again, so that nobody starts to
use and make any assumption about how the stack handles the three high
order DSCP bits in fib4 rules.

Fixes: 563f8e97 ("ipv4: Stop taking ECN bits into account in fib4-rules")
Signed-off-by: NGuillaume Nault <gnault@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

dc513a40

net: make net->dev_unreg_count atomic · ede6c39c

由 Eric Dumazet 提交于 2月 09, 2022

Having to acquire rtnl from netdev_run_todo() for every dismantled
device is not desirable when/if rtnl is under stress.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ede6c39c

net: mpls: Fix GCC 12 warning · c4416f5c

由 Victor Erminpour 提交于 2月 09, 2022

When building with automatic stack variable initialization, GCC 12
complains about variables defined outside of switch case statements.
Move the variable outside the switch, which silences the warning:

./net/mpls/af_mpls.c:1624:21: error: statement will never be executed [-Werror=switch-unreachable]
  1624 |                 int err;
       |                     ^~~
Signed-off-by: NVictor Erminpour <victor.erminpour@oracle.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c4416f5c

skbuff: cleanup double word in comment · 58e61e41

由 Tom Rix 提交于 2月 09, 2022

Remove the second 'to'.
Signed-off-by: NTom Rix <trix@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

58e61e41

net: ping6: support setting socket options via cmsg · 3ebb0b10

由 Jakub Kicinski 提交于 2月 09, 2022

Minor reordering of the code and a call to sock_cmsg_send()
gives us support for setting the common socket options via
cmsg (the usual ones - SO_MARK, SO_TIMESTAMPING_OLD, SCM_TXTIME).
Signed-off-by: NJakub Kicinski <kuba@kernel.org>
Reviewed-by: NDavid Ahern <dsahern@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3ebb0b10

net: ping6: support packet timestamping · e7b06046

由 Jakub Kicinski 提交于 2月 09, 2022

Nothing prevents the user from requesting timestamping
on ping6 sockets, yet timestamps are not going to be reported.
Plumb the flags through.
Signed-off-by: NJakub Kicinski <kuba@kernel.org>
Reviewed-by: NDavid Ahern <dsahern@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e7b06046

net: ping6: remove a pr_debug() statement · 42652239

由 Jakub Kicinski 提交于 2月 09, 2022

We have ftrace and BPF today, there's no need for printing arguments
at the start of a function.
Signed-off-by: NJakub Kicinski <kuba@kernel.org>
Reviewed-by: NDavid Ahern <dsahern@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

42652239

tipc: improve size validations for received domain records · 9aa422ad

由 Jon Maloy 提交于 2月 05, 2022

The function tipc_mon_rcv() allows a node to receive and process
domain_record structs from peer nodes to track their views of the
network topology.

This patch verifies that the number of members in a received domain
record does not exceed the limit defined by MAX_MON_DOMAIN, something
that may otherwise lead to a stack overflow.

tipc_mon_rcv() is called from the function tipc_link_proto_rcv(), where
we are reading a 32 bit message data length field into a uint16.  To
avert any risk of bit overflow, we add an extra sanity check for this in
that function.  We cannot see that happen with the current code, but
future designers being unaware of this risk, may introduce it by
allowing delivery of very large (> 64k) sk buffers from the bearer
layer.  This potential problem was identified by Eric Dumazet.

This fixes CVE-2022-0435
Reported-by: NSamuel Page <samuel.page@appgate.com>
Reported-by: NEric Dumazet <edumazet@google.com>
Fixes: 35c55c98 ("tipc: add neighbor monitoring framework")
Signed-off-by: NJon Maloy <jmaloy@redhat.com>
Reviewed-by: NXin Long <lucien.xin@gmail.com>
Reviewed-by: NSamuel Page <samuel.page@appgate.com>
Reviewed-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

9aa422ad

mptcp: netlink: process IPv6 addrs in creating listening sockets · 029744cd

由 Kishen Maloor 提交于 2月 09, 2022

This change updates mptcp_pm_nl_create_listen_socket() to create
listening sockets bound to IPv6 addresses (where IPv6 is supported).

Fixes: 1729cf18 ("mptcp: create the listening socket for new port")
Acked-by: NGeliang Tang <geliang.tang@suse.com>
Signed-off-by: NKishen Maloor <kishen.maloor@intel.com>
Signed-off-by: NMat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

029744cd

tcp: Don't acquire inet_listen_hashbucket::lock with disabled BH. · 4f9bf2a2

由 Sebastian Andrzej Siewior 提交于 2月 09, 2022

Commit
   9652dc2e ("tcp: relax listening_hash operations")

removed the need to disable bottom half while acquiring
listening_hash.lock. There are still two callers left which disable
bottom half before the lock is acquired.

On PREEMPT_RT the softirqs are preemptible and local_bh_disable() acts
as a lock to ensure that resources, that are protected by disabling
bottom halves, remain protected.
This leads to a circular locking dependency if the lock acquired with
disabled bottom halves is also acquired with enabled bottom halves
followed by disabling bottom halves. This is the reverse locking order.
It has been observed with inet_listen_hashbucket::lock:

local_bh_disable() + spin_lock(&ilb->lock):
  inet_listen()
    inet_csk_listen_start()
      sk->sk_prot->hash() := inet_hash()
	local_bh_disable()
	__inet_hash()
	  spin_lock(&ilb->lock);
	    acquire(&ilb->lock);

Reverse order: spin_lock(&ilb2->lock) + local_bh_disable():
  tcp_seq_next()
    listening_get_next()
      spin_lock(&ilb2->lock);
	acquire(&ilb2->lock);

  tcp4_seq_show()
    get_tcp4_sock()
      sock_i_ino()
	read_lock_bh(&sk->sk_callback_lock);
	  acquire(softirq_ctrl)	// <---- whoops
	  acquire(&sk->sk_callback_lock)

Drop local_bh_disable() around __inet_hash() which acquires
listening_hash->lock. Split inet_unhash() and acquire the
listen_hashbucket lock without disabling bottom halves; the inet_ehash
lock with disabled bottom halves.
Reported-by: NMike Galbraith <efault@gmx.de>
Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://lkml.kernel.org/r/12d6f9879a97cd56c09fb53dee343cbb14f7f1f7.camel@gmx.de
Link: https://lkml.kernel.org/r/X9CheYjuXWc75Spa@hirez.programming.kicks-ass.net
Link: https://lore.kernel.org/r/YgQOebeZ10eNx1W6@linutronix.deSigned-off-by: NJakub Kicinski <kuba@kernel.org>

4f9bf2a2

net: drop_monitor: support drop reason · 5cad527d

由 Menglong Dong 提交于 2月 09, 2022

In the commit c504e5c2 ("net: skb: introduce kfree_skb_reason()")
drop reason is introduced to the tracepoint of kfree_skb. Therefore,
drop_monitor is able to report the drop reason to users by netlink.

The drop reasons are reported as string to users, which is exactly
the same as what we do when reporting it to ftrace.
Signed-off-by: NMenglong Dong <imagedong@tencent.com>
Reviewed-by: NIdo Schimmel <idosch@nvidia.com>
Reviewed-by: NDavid Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20220209060838.55513-1-imagedong@tencent.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

5cad527d

bpf: Make remote_port field in struct bpf_sk_lookup 16-bit wide · 9a69e2b3

由 Jakub Sitnicki 提交于 2月 09, 2022

remote_port is another case of a BPF context field documented as a 32-bit
value in network byte order for which the BPF context access converter
generates a load of a zero-padded 16-bit integer in network byte order.

First such case was dst_port in bpf_sock which got addressed in commit
4421a582 ("bpf: Make dst_port field in struct bpf_sock 16-bit wide").

Loading 4-bytes from the remote_port offset and converting the value with
bpf_ntohl() leads to surprising results, as the expected value is shifted
by 16 bits.

Reduce the confusion by splitting the field in two - a 16-bit field holding
a big-endian integer, and a 16-bit zero-padding anonymous field that
follows it.
Suggested-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NJakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220209184333.654927-2-jakub@cloudflare.com

9a69e2b3

09 2月, 2022 1 次提交

vlan: move dev_put into vlan_dev_uninit · d6ff94af

由 Xin Long 提交于 2月 09, 2022

Shuang Li reported an QinQ issue by simply doing:

  # ip link add dummy0 type dummy
  # ip link add link dummy0 name dummy0.1 type vlan id 1
  # ip link add link dummy0.1 name dummy0.1.2 type vlan id 2
  # rmmod 8021q

 unregister_netdevice: waiting for dummy0.1 to become free. Usage count = 1

When rmmods 8021q, all vlan devs are deleted from their real_dev's vlan grp
and added into list_kill by unregister_vlan_dev(). dummy0.1 is unregistered
before dummy0.1.2, as it's using for_each_netdev() in __rtnl_kill_links().

When unregisters dummy0.1, dummy0.1.2 is not unregistered in the event of
NETDEV_UNREGISTER, as it's been deleted from dummy0.1's vlan grp. However,
due to dummy0.1.2 still holding dummy0.1, dummy0.1 will keep waiting in
netdev_wait_allrefs(), while dummy0.1.2 will never get unregistered and
release dummy0.1, as it delays dev_put until calling dev->priv_destructor,
vlan_dev_free().

This issue was introduced by Commit 563bcbae ("net: vlan: fix a UAF in
vlan_dev_real_dev()"), and this patch is to fix it by moving dev_put() into
vlan_dev_uninit(), which is called after NETDEV_UNREGISTER event but before
netdev_wait_allrefs().

Fixes: 563bcbae ("net: vlan: fix a UAF in vlan_dev_real_dev()")
Reported-by: NShuang Li <shuali@redhat.com>
Signed-off-by: NXin Long <lucien.xin@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d6ff94af

openeuler / Kernel 接近 2 年 前同步成功

openeuler / Kernel
接近 2 年前同步成功