提交 · 18596f504a3e56c4f8e132b2a437cbe23a3f4635 · openeuler / Kernel

12 2月, 2021 5 次提交

net: dsa: add support for offloading HSR · 18596f50

由 George McCollister 提交于 2月 09, 2021

Add support for offloading of HSR/PRP (IEC 62439-3) tag insertion
tag removal, duplicate generation and forwarding on DSA switches.

Add DSA_NOTIFIER_HSR_JOIN and DSA_NOTIFIER_HSR_LEAVE which trigger calls
to .port_hsr_join and .port_hsr_leave in the DSA driver for the switch.

The DSA switch driver should then set netdev feature flags for the
HSR/PRP operation that it offloads.
    NETIF_F_HW_HSR_TAG_INS
    NETIF_F_HW_HSR_TAG_RM
    NETIF_F_HW_HSR_FWD
    NETIF_F_HW_HSR_DUP
Signed-off-by: NGeorge McCollister <george.mccollister@gmail.com>
Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
Reviewed-by: NVladimir Oltean <olteanv@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

18596f50

net: hsr: add offloading support · dcf0cd1c

由 George McCollister 提交于 2月 09, 2021

Add support for offloading of HSR/PRP (IEC 62439-3) tag insertion
tag removal, duplicate generation and forwarding.

For HSR, insertion involves the switch adding a 6 byte HSR header after
the 14 byte Ethernet header. For PRP it adds a 6 byte trailer.

Tag removal involves automatically stripping the HSR/PRP header/trailer
in the switch. This is possible when the switch also performs auto
deduplication using the HSR/PRP header/trailer (making it no longer
required).

Forwarding involves automatically forwarding between redundant ports in
an HSR. This is crucial because delay is accumulated as a frame passes
through each node in the ring.

Duplication involves the switch automatically sending a single frame
from the CPU port to both redundant ports. This is required because the
inserted HSR/PRP header/trailer must contain the same sequence number
on the frames sent out both redundant ports.

Export is_hsr_master so DSA can tell them apart from other devices in
dsa_slave_changeupper.
Signed-off-by: NGeorge McCollister <george.mccollister@gmail.com>
Reviewed-by: NVladimir Oltean <olteanv@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

dcf0cd1c

net: hsr: generate supervision frame without HSR/PRP tag · 78be9217

由 George McCollister 提交于 2月 09, 2021

For a switch to offload insertion of HSR/PRP tags, frames must not be
sent to the CPU facing switch port with a tag. Generate supervision frames
(eth type ETH_P_PRP) without HSR v1 (ETH_P_HSR)/PRP tag and rely on
create_tagged_frame which inserts it later. This will allow skipping the
tag insertion for all outgoing frames in the future which is required for
HSR v1/PRP tag insertions to be offloaded.

HSR v0 supervision frames always contain tag information so insertion of
the tag can't be offloaded. IEC 62439-3 Ed.2.0 (HSR v1) specifically
notes that this was changed since v0 to allow offloading.
Signed-off-by: NGeorge McCollister <george.mccollister@gmail.com>
Reviewed-by: NVladimir Oltean <olteanv@gmail.com>
Tested-by: NVladimir Oltean <olteanv@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

78be9217

tcp: add some entropy in __inet_hash_connect() · c579bd1b

由 Eric Dumazet 提交于 2月 09, 2021

Even when implementing RFC 6056 3.3.4 (Algorithm 4: Double-Hash
Port Selection Algorithm), a patient attacker could still be able
to collect enough state from an otherwise idle host.

Idea of this patch is to inject some noise, in the
cases __inet_hash_connect() found a candidate in the first
attempt.

This noise should not significantly reduce the collision
avoidance, and should be zero if connection table
is already well used.

Note that this is not implementing RFC 6056 3.3.5
because we think Algorithm 5 could hurt typical
workloads.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: David Dworken <ddworken@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c579bd1b

tcp: change source port randomizarion at connect() time · 190cc824

由 Eric Dumazet 提交于 2月 09, 2021

RFC 6056 (Recommendations for Transport-Protocol Port Randomization)
provides good summary of why source selection needs extra care.

David Dworken reminded us that linux implements Algorithm 3
as described in RFC 6056 3.3.3

Quoting David :
   In the context of the web, this creates an interesting info leak where
   websites can count how many TCP connections a user's computer is
   establishing over time. For example, this allows a website to count
   exactly how many subresources a third party website loaded.
   This also allows:
   - Distinguishing between different users behind a VPN based on
       distinct source port ranges.
   - Tracking users over time across multiple networks.
   - Covert communication channels between different browsers/browser
       profiles running on the same computer
   - Tracking what applications are running on a computer based on
       the pattern of how fast source ports are getting incremented.

Section 3.3.4 describes an enhancement, that reduces
attackers ability to use the basic information currently
stored into the shared 'u32 hint'.

This change also decreases collision rate when
multiple applications need to connect() to
different destinations.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Reported-by: NDavid Dworken <ddworken@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

190cc824

11 2月, 2021 1 次提交

rxrpc: Fix missing dependency on NET_UDP_TUNNEL · dc0e6056

由 David Howells 提交于 2月 09, 2021

The changes to make rxrpc create the udp socket missed a bit to add the
Kconfig dependency on the udp tunnel code to do this.

Fix this by adding making AF_RXRPC select NET_UDP_TUNNEL.

Fixes: 1a9b86c9 ("rxrpc: use udp tunnel APIs instead of open code in rxrpc_open_socket")
Reported-by: Nkernel test robot <lkp@intel.com>
Signed-off-by: NVadim Fedorenko <vfedorenko@novek.ru>
Signed-off-by: NDavid Howells <dhowells@redhat.com>
Reviewed-by: NXin Long <lucien.xin@gmail.com>
cc: alaa@dev.mellanox.co.il
cc: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

dc0e6056

10 2月, 2021 4 次提交

vsock: fix locking in vsock_shutdown() · 1c5fae9c

由 Stefano Garzarella 提交于 2月 09, 2021

In vsock_shutdown() we touched some socket fields without holding the
socket lock, such as 'state' and 'sk_flags'.

Also, after the introduction of multi-transport, we are accessing
'vsk->transport' in vsock_send_shutdown() without holding the lock
and this call can be made while the connection is in progress, so
the transport can change in the meantime.

To avoid issues, we hold the socket lock when we enter in
vsock_shutdown() and release it when we leave.

Among the transports that implement the 'shutdown' callback, only
hyperv_transport acquired the lock. Since the caller now holds it,
we no longer take it.

Fixes: d021c344 ("VSOCK: Introduce VM Sockets")
Signed-off-by: NStefano Garzarella <sgarzare@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1c5fae9c

net: add sysfs attribute to control napi threaded mode · 5fdd2f0e

由 Wei Wang 提交于 2月 08, 2021

This patch adds a new sysfs attribute to the network device class.
Said attribute provides a per-device control to enable/disable the
threaded mode for all the napi instances of the given network device,
without the need for a device up/down.
User sets it to 1 or 0 to enable or disable threaded mode.
Note: when switching between threaded and the current softirq based mode
for a napi instance, it will not immediately take effect if the napi is
currently being polled. The mode switch will happen for the next time
napi_schedule() is called.
Co-developed-by: NPaolo Abeni <pabeni@redhat.com>
Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
Co-developed-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Co-developed-by: NFelix Fietkau <nbd@nbd.name>
Signed-off-by: NFelix Fietkau <nbd@nbd.name>
Signed-off-by: NWei Wang <weiwan@google.com>
Reviewed-by: NAlexander Duyck <alexanderduyck@fb.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5fdd2f0e

net: implement threaded-able napi poll loop support · 29863d41

由 Wei Wang 提交于 2月 08, 2021

This patch allows running each napi poll loop inside its own
kernel thread.
The kthread is created during netif_napi_add() if dev->threaded
is set. And threaded mode is enabled in napi_enable(). We will
provide a way to set dev->threaded and enable threaded mode
without a device up/down in the following patch.

Once that threaded mode is enabled and the kthread is
started, napi_schedule() will wake-up such thread instead
of scheduling the softirq.

The threaded poll loop behaves quite likely the net_rx_action,
but it does not have to manipulate local irqs and uses
an explicit scheduling point based on netdev_budget.
Co-developed-by: NPaolo Abeni <pabeni@redhat.com>
Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
Co-developed-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Co-developed-by: NJakub Kicinski <kuba@kernel.org>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>
Signed-off-by: NWei Wang <weiwan@google.com>
Reviewed-by: NAlexander Duyck <alexanderduyck@fb.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

29863d41

net: extract napi poll functionality to __napi_poll() · 898f8015

由 Felix Fietkau 提交于 2月 08, 2021

This commit introduces a new function __napi_poll() which does the main
logic of the existing napi_poll() function, and will be called by other
functions in later commits.
This idea and implementation is done by Felix Fietkau <nbd@nbd.name> and
is proposed as part of the patch to move napi work to work_queue
context.
This commit by itself is a code restructure.
Signed-off-by: NFelix Fietkau <nbd@nbd.name>
Signed-off-by: NWei Wang <weiwan@google.com>
Reviewed-by: NAlexander Duyck <alexanderduyck@fb.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

898f8015

09 2月, 2021 13 次提交

IPv6: Extend 'fib_notify_on_flag_change' sysctl · 6fad361a

由 Amit Cohen 提交于 2月 07, 2021

Add the value '2' to 'fib_notify_on_flag_change' to allow sending
notifications only for failed route installation.

Separate value is added for such notifications because there are less of
them, so they do not impact performance and some users will find them more
important.
Signed-off-by: NAmit Cohen <amcohen@nvidia.com>
Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6fad361a

IPv6: Add "offload failed" indication to routes · 0c5fcf9e

由 Amit Cohen 提交于 2月 07, 2021

After installing a route to the kernel, user space receives an
acknowledgment, which means the route was installed in the kernel, but not
necessarily in hardware.

The asynchronous nature of route installation in hardware can lead to a
routing daemon advertising a route before it was actually installed in
hardware. This can result in packet loss or mis-routed packets until the
route is installed in hardware.

To avoid such cases, previous patch set added the ability to emit
RTM_NEWROUTE notifications whenever RTM_F_OFFLOAD/RTM_F_TRAP flags
are changed, this behavior is controlled by sysctl.

With the above mentioned behavior, it is possible to know from user-space
if the route was offloaded, but if the offload fails there is no indication
to user-space. Following a failure, a routing daemon will wait indefinitely
for a notification that will never come.

This patch adds an "offload_failed" indication to IPv6 routes, so that
users will have better visibility into the offload process.

'struct fib6_info' is extended with new field that indicates if route
offload failed. Note that the new field is added using unused bit and
therefore there is no need to increase struct size.
Signed-off-by: NAmit Cohen <amcohen@nvidia.com>
Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0c5fcf9e

IPv4: Extend 'fib_notify_on_flag_change' sysctl · 648106c3

由 Amit Cohen 提交于 2月 07, 2021

Add the value '2' to 'fib_notify_on_flag_change' to allow sending
notifications only for failed route installation.

Separate value is added for such notifications because there are less of
them, so they do not impact performance and some users will find them more
important.
Signed-off-by: NAmit Cohen <amcohen@nvidia.com>
Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

648106c3

IPv4: Add "offload failed" indication to routes · 36c5100e

由 Amit Cohen 提交于 2月 07, 2021

After installing a route to the kernel, user space receives an
acknowledgment, which means the route was installed in the kernel, but not
necessarily in hardware.

The asynchronous nature of route installation in hardware can lead to a
routing daemon advertising a route before it was actually installed in
hardware. This can result in packet loss or mis-routed packets until the
route is installed in hardware.

To avoid such cases, previous patch set added the ability to emit
RTM_NEWROUTE notifications whenever RTM_F_OFFLOAD/RTM_F_TRAP flags
are changed, this behavior is controlled by sysctl.

With the above mentioned behavior, it is possible to know from user-space
if the route was offloaded, but if the offload fails there is no indication
to user-space. Following a failure, a routing daemon will wait indefinitely
for a notification that will never come.

This patch adds an "offload_failed" indication to IPv4 routes, so that
users will have better visibility into the offload process.

'struct fib_alias', and 'struct fib_rt_info' are extended with new field
that indicates if route offload failed. Note that the new field is added
using unused bit and therefore there is no need to increase structs size.
Signed-off-by: NAmit Cohen <amcohen@nvidia.com>
Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

36c5100e

bridge: mrp: Fix the usage of br_mrp_port_switchdev_set_state · b2bdba1c

由 Horatiu Vultur 提交于 2月 06, 2021

The function br_mrp_port_switchdev_set_state was called both with MRP
port state and STP port state, which is an issue because they don't
match exactly.

Therefore, update the function to be used only with STP port state and
use the id SWITCHDEV_ATTR_ID_PORT_STP_STATE.

The choice of using STP over MRP is that the drivers already implement
SWITCHDEV_ATTR_ID_PORT_STP_STATE and already in SW we update the port
STP state.

Fixes: 9a9f26e8 ("bridge: mrp: Connect MRP API with the switchdev API")
Fixes: fadd4091 ("bridge: switchdev: mrp: Implement MRP API for switchdev")
Fixes: 2f1a11ae ("bridge: mrp: Add MRP interface.")
Reported-by: NRasmus Villemoes <rasmus.villemoes@prevas.dk>
Signed-off-by: NHoratiu Vultur <horatiu.vultur@microchip.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b2bdba1c

netfilter: nftables: relax check for stateful expressions in set definition · 664899e8

由 Pablo Neira Ayuso 提交于 2月 08, 2021

Restore the original behaviour where users are allowed to add an element
with any stateful expression if the set definition specifies no stateful
expressions. Make sure upper maximum number of stateful expressions of
NFT_SET_EXPR_MAX is not reached.

Fixes: 8cfd9b0f ("netfilter: nftables: generalize set expressions support")
Fixes: 48b0ae04 ("netfilter: nftables: netlink support for several set element expressions")
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

664899e8

net: bridge: use switchdev for port flags set through sysfs too · 8043c845

由 Vladimir Oltean 提交于 2月 07, 2021

Looking through patchwork I don't see that there was any consensus to
use switchdev notifiers only in case of netlink provided port flags but
not sysfs (as a sort of deprecation, punishment or anything like that),
so we should probably keep the user interface consistent in terms of
functionality.

http://patchwork.ozlabs.org/project/netdev/patch/20170605092043.3523-3-jiri@resnulli.us/
http://patchwork.ozlabs.org/project/netdev/patch/20170608064428.4785-3-jiri@resnulli.us/

Fixes: 3922285d ("net: bridge: Add support for offloading port attributes")
Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: NNikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8043c845

rxrpc: use udp tunnel APIs instead of open code in rxrpc_open_socket · 1a9b86c9

由 Xin Long 提交于 2月 07, 2021

In rxrpc_open_socket(), now it's using sock_create_kern() and
kernel_bind() to create a udp tunnel socket, and other kernel
APIs to set up it. These code can be replaced with udp tunnel
APIs udp_sock_create() and setup_udp_tunnel_sock(), and it'll
simplify rxrpc_open_socket().

Note that with this patch, the udp tunnel socket will always
bind to a random port if transport is not provided by users,
which is suggested by David Howells, thanks!
Acked-by: NDavid Howells <dhowells@redhat.com>
Signed-off-by: NXin Long <lucien.xin@gmail.com>
Reviewed-by: NVadim Fedorenko <vfedorenko@novek.ru>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1a9b86c9

net-sysfs: Add rtnl locking for getting Tx queue traffic class · b2f17564

由 Alexander Duyck 提交于 2月 08, 2021

In order to access the suboordinate dev for a device we should be holding
the rtnl_lock when outside of the transmit path. The existing code was not
doing that for the sysfs dump function and as a result we were open to a
possible race.

To resolve that take the rtnl lock prior to accessing the sb_dev field of
the Tx queue and release it after we have retrieved the tc for the queue.
Signed-off-by: NAlexander Duyck <alexanderduyck@fb.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b2f17564

netfilter: conntrack: skip identical origin tuple in same zone only · 07998281

由 Florian Westphal 提交于 2月 05, 2021

The origin skip check needs to re-test the zone. Else, we might skip
a colliding tuple in the reply direction.

This only occurs when using 'directional zones' where origin tuples
reside in different zones but the reply tuples share the same zone.

This causes the new conntrack entry to be dropped at confirmation time
because NAT clash resolution was elided.

Fixes: 4e35c1cb ("netfilter: nf_nat: skip nat clash resolution for same-origin entries")
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

07998281

vsock/virtio: update credit only if socket is not closed · ce7536bc

由 Stefano Garzarella 提交于 2月 08, 2021

If the socket is closed or is being released, some resources used by
virtio_transport_space_update() such as 'vsk->trans' may be released.

To avoid a use after free bug we should only update the available credit
when we are sure the socket is still open and we have the lock held.

Fixes: 06a8fc78 ("VSOCK: Introduce virtio_vsock_common.ko")
Signed-off-by: NStefano Garzarella <sgarzare@redhat.com>
Acked-by: NMichael S. Tsirkin <mst@redhat.com>
Link: https://lore.kernel.org/r/20210208144454.84438-1-sgarzare@redhat.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

ce7536bc

seg6: fool-proof the processing of SRv6 behavior attributes · 300a0fd8

由 Andrea Mayer 提交于 2月 06, 2021

The set of required attributes for a given SRv6 behavior is identified
using a bitmap stored in an unsigned long, since the initial design of SRv6
networking in Linux. Recently the same approach has been used for
identifying the optional attributes.

However, the number of attributes supported by SRv6 behaviors depends on
the size of the unsigned long type which changes with the architecture.
Indeed, on a 64-bit architecture, an SRv6 behavior can support up to 64
attributes while on a 32-bit architecture it can support at most 32
attributes.

To fool-proof the processing of SRv6 behaviors we verify, at compile time,
that the set of all supported SRv6 attributes can be encoded into a bitmap
stored in an unsigned long. Otherwise, kernel build fails forcing
developers to reconsider adding a new attribute or extend the total
number of supported attributes by the SRv6 behaviors.

Moreover, we replace all patterns (1 << i) with the macro SEG6_F_ATTR(i) in
order to address potential overflow issues caused by 32-bit signed
arithmetic.

Thanks to Colin Ian King for catching the overflow problem, providing a
solution and inspiring this patch.
Thanks to Jakub Kicinski for his useful suggestions during the design of
this patch.

v2:
- remove the SEG6_LOCAL_MAX_SUPP which is not strictly needed: it can
be derived from the unsigned long type. Thanks to David Ahern for
pointing it out.
Signed-off-by: NAndrea Mayer <andrea.mayer@uniroma2.it>
Reviewed-by: NDavid Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20210206170934.5982-1-andrea.mayer@uniroma2.itSigned-off-by: NJakub Kicinski <kuba@kernel.org>

300a0fd8

net: fix iteration for sctp transport seq_files · af8085f3

由 NeilBrown 提交于 2月 05, 2021

The sctp transport seq_file iterators take a reference to the transport
in the ->start and ->next functions and releases the reference in the
->show function.  The preferred handling for such resources is to
release them in the subsequent ->next or ->stop function call.

Since Commit 1f4aace6 ("fs/seq_file.c: simplify seq_file iteration
code and interface") there is no guarantee that ->show will be called
after ->next, so this function can now leak references.

So move the sctp_transport_put() call to ->next and ->stop.

Fixes: 1f4aace6 ("fs/seq_file.c: simplify seq_file iteration code and interface")
Reported-by: NXin Long <lucien.xin@gmail.com>
Signed-off-by: NNeilBrown <neilb@suse.de>
Acked-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

af8085f3

07 2月, 2021 13 次提交

net/vmw_vsock: improve locking in vsock_connect_timeout() · 3d0bc44d

由 Norbert Slusarek 提交于 2月 05, 2021

A possible locking issue in vsock_connect_timeout() was recognized by
Eric Dumazet which might cause a null pointer dereference in
vsock_transport_cancel_pkt(). This patch assures that
vsock_transport_cancel_pkt() will be called within the lock, so a race
condition won't occur which could result in vsk->transport to be set to NULL.

Fixes: 380feae0 ("vsock: cancel packets when failing to connect")
Reported-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NNorbert Slusarek <nslusarek@gmx.net>
Reviewed-by: NStefano Garzarella <sgarzare@redhat.com>
Link: https://lore.kernel.org/r/trinity-f8e0937a-cf0e-4d80-a76e-d9a958ba3ef1-1612535522360@3c-app-gmx-bap12Signed-off-by: NJakub Kicinski <kuba@kernel.org>

3d0bc44d

net/vmw_vsock: fix NULL pointer dereference · 5d1cbcc9

由 Norbert Slusarek 提交于 2月 05, 2021

In vsock_stream_connect(), a thread will enter schedule_timeout().
While being scheduled out, another thread can enter vsock_stream_connect()
as well and set vsk->transport to NULL. In case a signal was sent, the
first thread can leave schedule_timeout() and vsock_transport_cancel_pkt()
will be called right after. Inside vsock_transport_cancel_pkt(), a null
dereference will happen on transport->cancel_pkt.

Fixes: c0cfa2d8 ("vsock: add multi-transports support")
Signed-off-by: NNorbert Slusarek <nslusarek@gmx.net>
Reviewed-by: NStefano Garzarella <sgarzare@redhat.com>
Link: https://lore.kernel.org/r/trinity-c2d6cede-bfb1-44e2-85af-1fbc7f541715-1612535117028@3c-app-gmx-bap12Signed-off-by: NJakub Kicinski <kuba@kernel.org>

5d1cbcc9

net/packet: Improve the comment about LL header visibility criteria · 21c85974

由 Xie He 提交于 2月 05, 2021

The "dev_has_header" function, recently added in
commit d5496990 ("net/packet: fix packet receive on L3 devices
without visible hard header"),
is more accurate as criteria for determining whether a device exposes
the LL header to upper layers, because in addition to dev->header_ops,
it also checks for dev->header_ops->create.

When transmitting an skb on a device, dev_hard_header can be called to
generate an LL header. dev_hard_header will only generate a header if
dev->header_ops->create is present.
Signed-off-by: NXie He <xie.he.0141@gmail.com>
Acked-by: NWillem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20210205224124.21345-1-xie.he.0141@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

21c85974

net: dsa: make assisted_learning_on_cpu_port bypass offloaded LAG interfaces · a324d3d4

由 Vladimir Oltean 提交于 2月 06, 2021

Given the following topology, and focusing only on Box A:

         Box A
         +----------------------------------+
         | Board 1         br0              |
         |             +---------+          |
         |            /           \         |
         |            |           |         |
         |            |         bond0       |
         |            |        +-----+      |
         |192.168.1.1 |       /       \     |
         |  eno0     swp0    swp1    swp2   |
         +---|--------|-------|-------|-----+
             |        |       |       |
             +--------+       |       |
               Cable          |       |
                         Cable|       |Cable
               Cable          |       |
             +--------+       |       |
             |        |       |       |
         +---|--------|-------|-------|-----+
         |  eno0     swp0    swp1    swp2   |
         |192.168.1.2 |       \       /     |
         |            |        +-----+      |
         |            |         bond0       |
         |            |           |         |
         |            \           /         |
         |             +---------+          |
         | Board 2         br0              |
         +----------------------------------+
         Box B

The assisted_learning_on_cpu_port logic will see that swp0 is bridged
with a "foreign interface" (bond0) and will therefore install all
addresses learnt by the software bridge towards bond0 (including the
address of eno0 on Box B) as static addresses towards the CPU port.

But that's not what we want - bond0 is not really a "foreign interface"
but one we can offload including L2 forwarding from/towards it. So we
need to refine our logic for assisted learning such that, whenever we
see an address learnt on a non-DSA interface, we search through the tree
for any port that offloads that non-DSA interface.

Some confusion might arise as to why we search through the whole tree
instead of just the local switch returned by dsa_slave_dev_lower_find.
Or a different angle of the same confusion: why does
dsa_slave_dev_lower_find(br_dev) return a single dp that's under br_dev
instead of the whole list of bridged DSA ports?

To answer the second question, it should be enough to install the static
FDB entry on the CPU port of a single switch in the tree, because
dsa_port_fdb_add uses DSA_NOTIFIER_FDB_ADD which ensures that all other
switches in the tree get notified of that address, and add the entry
themselves using dsa_towards_port().

This should help understand the answer to the first question: the port
returned by dsa_slave_dev_lower_find may not be on the same switch as
the ports that offload the LAG. Nonetheless, if the driver implements
.crosschip_lag_join and .crosschip_bridge_join as mv88e6xxx does, there
still isn't any reason for trapping addresses learnt on the remote LAG
towards the CPU, and we should prevent that.
Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

a324d3d4

Revert "net: ipv4: handle DSA enabled master network devices" · 46acf7bd

由 Vladimir Oltean 提交于 2月 05, 2021

This reverts commit 728c0208.

Since 2015 DSA has gained more integration with the network stack, we
can now have the same functionality without explicitly open-coding for
it:
- It now opens the DSA master netdevice automatically whenever a user
  netdevice is opened.
- The master and switch interfaces are coupled in an upper/lower
  hierarchy using the netdev adjacency lists.

In the nfsroot example below, the interface chosen by autoconfig was
swp3, and every interface except that and the DSA master, eth1, was
brought down afterwards:

[    8.714215] mscc_felix 0000:00:00.5 swp0 (uninitialized): PHY [0000:00:00.3:10] driver [Microsemi GE VSC8514 SyncE] (irq=POLL)
[    8.978041] mscc_felix 0000:00:00.5 swp1 (uninitialized): PHY [0000:00:00.3:11] driver [Microsemi GE VSC8514 SyncE] (irq=POLL)
[    9.246134] mscc_felix 0000:00:00.5 swp2 (uninitialized): PHY [0000:00:00.3:12] driver [Microsemi GE VSC8514 SyncE] (irq=POLL)
[    9.486203] mscc_felix 0000:00:00.5 swp3 (uninitialized): PHY [0000:00:00.3:13] driver [Microsemi GE VSC8514 SyncE] (irq=POLL)
[    9.512827] mscc_felix 0000:00:00.5: configuring for fixed/internal link mode
[    9.521047] mscc_felix 0000:00:00.5: Link is Up - 2.5Gbps/Full - flow control off
[    9.530382] device eth1 entered promiscuous mode
[    9.535452] DSA: tree 0 setup
[    9.539777] printk: console [netcon0] enabled
[    9.544504] netconsole: network logging started
[    9.555047] fsl_enetc 0000:00:00.2 eth1: configuring for fixed/internal link mode
[    9.562790] fsl_enetc 0000:00:00.2 eth1: Link is Up - 1Gbps/Full - flow control off
[    9.564661] 8021q: adding VLAN 0 to HW filter on device bond0
[    9.637681] fsl_enetc 0000:00:00.0 eth0: PHY [0000:00:00.0:02] driver [Qualcomm Atheros AR8031/AR8033] (irq=POLL)
[    9.655679] fsl_enetc 0000:00:00.0 eth0: configuring for inband/sgmii link mode
[    9.666611] mscc_felix 0000:00:00.5 swp0: configuring for inband/qsgmii link mode
[    9.676216] 8021q: adding VLAN 0 to HW filter on device swp0
[    9.682086] mscc_felix 0000:00:00.5 swp1: configuring for inband/qsgmii link mode
[    9.690700] 8021q: adding VLAN 0 to HW filter on device swp1
[    9.696538] mscc_felix 0000:00:00.5 swp2: configuring for inband/qsgmii link mode
[    9.705131] 8021q: adding VLAN 0 to HW filter on device swp2
[    9.710964] mscc_felix 0000:00:00.5 swp3: configuring for inband/qsgmii link mode
[    9.719548] 8021q: adding VLAN 0 to HW filter on device swp3
[    9.747811] Sending DHCP requests ..
[   12.742899] mscc_felix 0000:00:00.5 swp1: Link is Up - 1Gbps/Full - flow control rx/tx
[   12.743828] mscc_felix 0000:00:00.5 swp0: Link is Up - 1Gbps/Full - flow control off
[   12.747062] IPv6: ADDRCONF(NETDEV_CHANGE): swp1: link becomes ready
[   12.755216] fsl_enetc 0000:00:00.0 eth0: Link is Up - 1Gbps/Full - flow control rx/tx
[   12.766603] IPv6: ADDRCONF(NETDEV_CHANGE): swp0: link becomes ready
[   12.783188] mscc_felix 0000:00:00.5 swp2: Link is Up - 1Gbps/Full - flow control rx/tx
[   12.785354] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[   12.799535] IPv6: ADDRCONF(NETDEV_CHANGE): swp2: link becomes ready
[   13.803141] mscc_felix 0000:00:00.5 swp3: Link is Up - 1Gbps/Full - flow control rx/tx
[   13.811646] IPv6: ADDRCONF(NETDEV_CHANGE): swp3: link becomes ready
[   15.452018] ., OK
[   15.470336] IP-Config: Got DHCP answer from 10.0.0.1, my address is 10.0.0.39
[   15.477887] IP-Config: Complete:
[   15.481330]      device=swp3, hwaddr=00:04:9f:05:de:0a, ipaddr=10.0.0.39, mask=255.255.255.0, gw=10.0.0.1
[   15.491846]      host=10.0.0.39, domain=(none), nis-domain=(none)
[   15.498429]      bootserver=10.0.0.1, rootserver=10.0.0.1, rootpath=
[   15.498481]      nameserver0=8.8.8.8
[   15.627542] fsl_enetc 0000:00:00.0 eth0: Link is Down
[   15.690903] mscc_felix 0000:00:00.5 swp0: Link is Down
[   15.745216] mscc_felix 0000:00:00.5 swp1: Link is Down
[   15.800498] mscc_felix 0000:00:00.5 swp2: Link is Down
Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

46acf7bd

Revert "net: Have netpoll bring-up DSA management interface" · ea92000d

由 Vladimir Oltean 提交于 2月 05, 2021

This reverts commit 1532b977.

The above commit is good and it works, however it was meant as a bugfix
for stable kernels and now we have more self-contained ways in DSA to
handle the situation where the DSA master must be brought up.
Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

ea92000d

net: dsa: automatically bring user ports down when master goes down · c0a8a9c2

由 Vladimir Oltean 提交于 2月 05, 2021

This is not fixing any actual bug that I know of, but having a DSA
interface that is up even when its lower (master) interface is down is
one of those things that just do not sound right.

Yes, DSA checks if the master is up before actually bringing the
user interface up, but nobody prevents bringing the master interface
down immediately afterwards... Then the user ports would attempt
dev_queue_xmit on an interface that is down, and wonder what's wrong.

This patch prevents that from happening. NETDEV_GOING_DOWN is the
notification emitted _before_ the master actually goes down, and we are
protected by the rtnl_mutex, so all is well.

For those of you reading this because you were doing switch testing
such as latency measurements for autonomously forwarded traffic, and you
needed a controlled environment with no extra packets sent by the
network stack, this patch breaks that, because now the user ports go
down too, which may shut down the PHY etc. But please don't do it like
that, just do instead:

tc qdisc add dev eno2 clsact
tc filter add dev eno2 egress flower action drop

Tested with two cascaded DSA switches:
$ ip link set eno2 down
sja1105 spi2.0 sw0p2: Link is Down
mscc_felix 0000:00:00.5 swp0: Link is Down
fsl_enetc 0000:00:00.2 eno2: Link is Down
Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

c0a8a9c2

net: dsa: automatically bring up DSA master when opening user port · 9d5ef190

由 Vladimir Oltean 提交于 2月 05, 2021

DSA wants the master interface to be open before the user port is due to
historical reasons. The promiscuity of interfaces that are down used to
have issues, as referenced Lennert Buytenhek in commit df02c6ff
("dsa: fix master interface allmulti/promisc handling").

The bugfix mentioned there, commit b6c40d68 ("net: only invoke
dev->change_rx_flags when device is UP"), was basically a "don't do
that" approach to working around the promiscuity while down issue.

Further work done by Vlad Yasevich in commit d2615bf4 ("net: core:
Always propagate flag changes to interfaces") has resolved the
underlying issue, and it is strictly up to the DSA and 8021q drivers
now, it is no longer mandated by the networking core that the master
interface must be up when changing its promiscuity.

From DSA's point of view, deciding to error out in dsa_slave_open
because the master isn't up is
(a) a bad user experience and
(b) knocking at an open door.
Even if there still was an issue with promiscuity while down, DSA could
still just open the master and avoid it.

Doing it this way has the additional benefit that user space can now
remove DSA-specific workarounds, like systemd-networkd with BindCarrier:
https://github.com/systemd/systemd/issues/7478

And we can finally remove one of the 2 bullets in the "Common pitfalls
using DSA setups" chapter.

Tested with two cascaded DSA switches:

$ ip link set sw0p2 up
fsl_enetc 0000:00:00.2 eno2: configuring for fixed/internal link mode
fsl_enetc 0000:00:00.2 eno2: Link is Up - 1Gbps/Full - flow control rx/tx
mscc_felix 0000:00:00.5 swp0: configuring for fixed/sgmii link mode
mscc_felix 0000:00:00.5 swp0: Link is Up - 1Gbps/Full - flow control off
8021q: adding VLAN 0 to HW filter on device swp0
sja1105 spi2.0 sw0p2: configuring for phy/rgmii-id link mode
IPv6: ADDRCONF(NETDEV_CHANGE): eno2: link becomes ready
IPv6: ADDRCONF(NETDEV_CHANGE): swp0: link becomes ready
Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

9d5ef190

mptcp: pm: add lockdep assertions · 3abc05d9

由 Florian Westphal 提交于 2月 04, 2021

Add a few assertions to make sure functions are called with the needed
locks held.
Two functions gain might_sleep annotations because they contain
conditional calls to functions that sleep.
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NMat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

3abc05d9

net: Introduce {netdev,napi}_alloc_frag_align() · 3f6e687d

由 Kevin Hao 提交于 2月 04, 2021

In the current implementation of {netdev,napi}_alloc_frag(), it doesn't
have any align guarantee for the returned buffer address, But for some
hardwares they do require the DMA buffer to be aligned correctly,
so we would have to use some workarounds like below if the buffers
allocated by the {netdev,napi}_alloc_frag() are used by these hardwares
for DMA.
    buf = napi_alloc_frag(really_needed_size + align);
    buf = PTR_ALIGN(buf, align);

These codes seems ugly and would waste a lot of memories if the buffers
are used in a network driver for the TX/RX. We have added the align
support for the page_frag functions, so add the corresponding
{netdev,napi}_frag functions.
Signed-off-by: NKevin Hao <haokexin@gmail.com>
Reviewed-by: NAlexander Duyck <alexanderduyck@fb.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

3f6e687d

net: sched: Return the correct errno code · a64566a2

由 Zheng Yongjun 提交于 2月 04, 2021

When kalloc or kmemdup failed, should return ENOMEM rather than ENOBUF.
Signed-off-by: NZheng Yongjun <zhengyongjun3@huawei.com>
Link: https://lore.kernel.org/r/20210204073950.18372-1-zhengyongjun3@huawei.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

a64566a2

dccp: Return the correct errno code · 247b557e

由 Zheng Yongjun 提交于 2月 04, 2021

When kalloc or kmemdup failed, should return ENOMEM rather than ENOBUF.
Signed-off-by: NZheng Yongjun <zhengyongjun3@huawei.com>
Link: https://lore.kernel.org/r/20210204072820.17723-1-zhengyongjun3@huawei.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

247b557e

net: bridge: mcast: Use ERR_CAST instead of ERR_PTR(PTR_ERR()) · 1697291d

由 Xu Wang 提交于 2月 04, 2021

Use ERR_CAST inlined function instead of ERR_PTR(PTR_ERR(...)).

net/bridge/br_multicast.c:1246:9-16: WARNING: ERR_CAST can be used with mp
Generated by: scripts/coccinelle/api/err_cast.cocci
Signed-off-by: NXu Wang <vulab@iscas.ac.cn>
Link: https://lore.kernel.org/r/20210204070549.83636-1-vulab@iscas.ac.cnSigned-off-by: NJakub Kicinski <kuba@kernel.org>

1697291d

06 2月, 2021 4 次提交

batman-adv: Fix names for kernel-doc blocks · 25d81f93

由 Sven Eckelmann 提交于 1月 20, 2021

kernel-doc can only correctly identify the documented function or struct
when the name in the first kernel-doc line references it. But some of the
kernel-doc blocks referenced a different function/struct then it actually
documented.
Signed-off-by: NSven Eckelmann <sven@narfation.org>
Signed-off-by: NSimon Wunderlich <sw@simonwunderlich.de>

25d81f93

batman-adv: Avoid sizeof on flexible structure · 576fb671

由 Sven Eckelmann 提交于 12月 29, 2020

The batadv_dhcp_packet is used to read in parts of the DHCP packet and
extract relevant information for the distributed arp table. But the
structure contained the flexible member "options" which is no where used in
the code.

A sizeof on this kind of type would return the size of everything except
the flexible member. But sparse will detect this kind of sizeof and warn
with

  warning: using sizeof on a flexible structure

This can be avoided by dropping the unused flexible member.
Signed-off-by: NSven Eckelmann <sven@narfation.org>
Signed-off-by: NSimon Wunderlich <sw@simonwunderlich.de>

576fb671

batman-adv: Drop publication years from copyright info · cfa55c6d

由 Sven Eckelmann 提交于 1月 01, 2021

The batman-adv source code was using the year of publication (to net-next)
as "last" year for the copyright statement. The whole source code mentioned
in the MAINTAINERS "BATMAN ADVANCED" section was handled as a single entity
regarding the publishing year.

This avoided having outdated (in sense of year information - not copyright
holder) publishing information inside several files. But since the simple
"update copyright year" commit (without other changes) in the file was not
well received in the upstream kernel, the option to not have a copyright
year (for initial and last publication) in the files are chosen instead.
More detailed information about the years can still be retrieved from the
SCM system.
Signed-off-by: NSven Eckelmann <sven@narfation.org>
Acked-by: NMarek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: NSimon Wunderlich <sw@simonwunderlich.de>

cfa55c6d

net: gro: do not keep too many GRO packets in napi->rx_list · 8dc1c444

由 Eric Dumazet 提交于 2月 04, 2021

Commit c8079432 ("net: Fix packet reordering caused by GRO and
listified RX cooperation") had the unfortunate effect of adding
latencies in common workloads.

Before the patch, GRO packets were immediately passed to
upper stacks.

After the patch, we can accumulate quite a lot of GRO
packets (depdending on NAPI budget).

My fix is counting in napi->rx_count number of segments
instead of number of logical packets.

Fixes: c8079432 ("net: Fix packet reordering caused by GRO and listified RX cooperation")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Bisected-by: NJohn Sperbeck <jsperbeck@google.com>
Tested-by: NJian Yang <jianyang@google.com>
Cc: Maxim Mikityanskiy <maximmi@mellanox.com>
Reviewed-by: NSaeed Mahameed <saeedm@nvidia.com>
Reviewed-by: NEdward Cree <ecree.xilinx@gmail.com>
Reviewed-by: NAlexander Lobakin <alobakin@pm.me>
Link: https://lore.kernel.org/r/20210204213146.4192368-1-eric.dumazet@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

8dc1c444

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功