提交 · f5aef4241592765a868611509e5170bbe8c89ae7 · openeuler / Kernel

23 9月, 2021 2 次提交

net: dsa: sja1105: break dependency between dsa_port_is_sja1105 and switch driver · f5aef424

由 Vladimir Oltean 提交于 9月 22, 2021

It's nice to be able to test a tagging protocol with dsa_loop, but not
at the cost of losing the ability of building the tagging protocol and
switch driver as modules, because as things stand, there is a circular
dependency between the two. Tagging protocol drivers cannot depend on
switch drivers, that is a hard fact.

The reasoning behind the blamed patch was that accessing dp->priv should
first make sure that the structure behind that pointer is what we really
think it is.

Currently the "sja1105" and "sja1110" tagging protocols only operate
with the sja1105 switch driver, just like any other tagging protocol and
switch combination. The only way to mix and match them is by modifying
the code, and this applies to dsa_loop as well (by default that uses
DSA_TAG_PROTO_NONE). So while in principle there is an issue, in
practice there isn't one.

Until we extend dsa_loop to allow user space configuration, treat the
problem as a non-issue and just say that DSA ports found by tag_sja1105
are always sja1105 ports, which is in fact true. But keep the
dsa_port_is_sja1105 function so that it's easy to patch it during
testing, and rely on dead code elimination.

Fixes: 994d2cbb ("net: dsa: tag_sja1105: be dsa_loop-safe")
Link: https://lore.kernel.org/netdev/20210908220834.d7gmtnwrorhharna@skbuf/Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f5aef424

net: dsa: move sja1110_process_meta_tstamp inside the tagging protocol driver · 6d709cad

由 Vladimir Oltean 提交于 9月 22, 2021

The problem is that DSA tagging protocols really must not depend on the
switch driver, because this creates a circular dependency at insmod
time, and the switch driver will effectively not load when the tagging
protocol driver is missing.

The code was structured in the way it was for a reason, though. The DSA
driver-facing API for PTP timestamping relies on the assumption that
two-step TX timestamps are provided by the hardware in an out-of-band
manner, typically by raising an interrupt and making that timestamp
available inside some sort of FIFO which is to be accessed over
SPI/MDIO/etc.

So the API puts .port_txtstamp into dsa_switch_ops, because it is
expected that the switch driver needs to save some state (like put the
skb into a queue until its TX timestamp arrives).

On SJA1110, TX timestamps are provided by the switch as Ethernet
packets, so this makes them be received and processed by the tagging
protocol driver. This in itself is great, because the timestamps are
full 64-bit and do not require reconstruction, and since Ethernet is the
fastest I/O method available to/from the switch, PTP timestamps arrive
very quickly, no matter how bottlenecked the SPI connection is, because
SPI interaction is not needed at all.

DSA's code structure and strict isolation between the tagging protocol
driver and the switch driver break the natural code organization.

When the tagging protocol driver receives a packet which is classified
as a metadata packet containing timestamps, it passes those timestamps
one by one to the switch driver, which then proceeds to compare them
based on the recorded timestamp ID that was generated in .port_txtstamp.

The communication between the tagging protocol and the switch driver is
done through a method exported by the switch driver, sja1110_process_meta_tstamp.
To satisfy build requirements, we force a dependency to build the
tagging protocol driver as a module when the switch driver is a module.
However, as explained in the first paragraph, that causes the circular
dependency.

To solve this, move the skb queue from struct sja1105_private :: struct
sja1105_ptp_data to struct sja1105_private :: struct sja1105_tagger_data.
The latter is a data structure for which hacks have already been put
into place to be able to create persistent storage per switch that is
accessible from the tagging protocol driver (see sja1105_setup_ports).

With the skb queue directly accessible from the tagging protocol driver,
we can now move sja1110_process_meta_tstamp into the tagging driver
itself, and avoid exporting a symbol.

Fixes: 566b18c8 ("net: dsa: sja1105: implement TX timestamping for SJA1110")
Link: https://lore.kernel.org/netdev/20210908220834.d7gmtnwrorhharna@skbuf/Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6d709cad

22 9月, 2021 2 次提交

skbuff: pass the result of data ksize to __build_skb_around · a5df6333

由 Li RongQing 提交于 9月 22, 2021

Avoid to call ksize again in __build_skb_around by passing
the result of data ksize to __build_skb_around

nginx stress test shows this change can reduce ksize cpu usage,
and give a little performance boost
Signed-off-by: NLi RongQing <lirongqing@baidu.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a5df6333

devlink: Make devlink_register to be void · db4278c5

由 Leon Romanovsky 提交于 9月 22, 2021

devlink_register() can't fail and always returns success, but all drivers
are obligated to check returned status anyway. This adds a lot of boilerplate
code to handle impossible flow.

Make devlink_register() void and simplify the drivers that use that
API call.
Signed-off-by: NLeon Romanovsky <leonro@nvidia.com>
Acked-by: NSimon Horman <simon.horman@corigine.com>
Acked-by: Vladimir Oltean <olteanv@gmail.com> # dsa
Reviewed-by: NJiri Pirko <jiri@nvidia.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

db4278c5

21 9月, 2021 3 次提交

net/ipv4/sysctl_net_ipv4.c: remove superfluous header files from sysctl_net_ipv4.c · 07b85562

由 Mianhan Liu 提交于 9月 21, 2021

sysctl_net_ipv4.c hasn't use any macro or function declared in igmp.h,
inetdevice.h, mm.h, module.h, nsproxy.h, swap.h, inet_frag.h, route.h
and snmp.h. Thus, these files can be removed from sysctl_net_ipv4.c
safely without affecting the compilation of the net module.
Signed-off-by: NMianhan Liu <liumh1@shanghaitech.edu.cn>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

07b85562

net/ipv4/syncookies.c: remove superfluous header files from syncookies.c · c595b120

由 Mianhan Liu 提交于 9月 20, 2021

syncookies.c hasn't use any macro or function declared in slab.h and random.h,
Thus, these files can be removed from syncookies.c safely without
affecting the compilation of the net module.
Signed-off-by: NMianhan Liu <liumh1@shanghaitech.edu.cn>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c595b120

net/ipv4/udp_tunnel_core.c: remove superfluous header files from udp_tunnel_core.c · bea71458

由 Mianhan Liu 提交于 9月 20, 2021

udp_tunnel_core.c hasn't use any macro or function declared in udp.h, types.h,
and net_namespace.h. Thus, these files can be removed from udp_tunnel_core.c
safely without affecting the compilation of the net module.
Signed-off-by: NMianhan Liu <liumh1@shanghaitech.edu.cn>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

bea71458

20 9月, 2021 3 次提交

net/ipv4/tcp_minisocks.c: remove superfluous header files from tcp_minisocks.c · 85c69886

由 Mianhan Liu 提交于 9月 20, 2021

tcp_minisocks.c hasn't use any macro or function declared in mm.h, module.h,
slab.h, sysctl.h, workqueue.h, static_key.h and inet_common.h. Thus, these
files can be removed from tcp_minisocks.c safely without affecting the
compilation of the net module.
Signed-off-by: NMianhan Liu <liumh1@shanghaitech.edu.cn>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

85c69886

net/ipv4/tcp_fastopen.c: remove superfluous header files from tcp_fastopen.c · 222a3140

由 Mianhan Liu 提交于 9月 20, 2021

tcp_fastopen.c hasn't use any macro or function declared in crypto.h, err.h,
init.h, list.h, rculist.h and inetpeer.h. Thus, these files can be removed
from tcp_fastopen.c safely without affecting the compilation of the net module.
Signed-off-by: NMianhan Liu <liumh1@shanghaitech.edu.cn>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

222a3140

net/ipv4/route.c: remove superfluous header files from route.c · ffa66f15

由 Mianhan Liu 提交于 9月 20, 2021

route.c hasn't use any macro or function declared in uaccess.h, types.h,
string.h, sockios.h, times.h, protocol.h, arp.h and l3mdev.h. Thus, these
files can be removed from route.c safely without affecting the compilation
of the net module.
Signed-off-by: NMianhan Liu <liumh1@shanghaitech.edu.cn>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ffa66f15

19 9月, 2021 4 次提交

net: sched: move and reuse mq_change_real_num_tx() · f7116fb4

由 Jakub Kicinski 提交于 9月 17, 2021

The code for handling active queue changes is identical
between mq and mqprio, reuse it.
Suggested-by: NCong Wang <cong.wang@bytedance.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f7116fb4

net: rtnetlink: convert rcu_assign_pointer to RCU_INIT_POINTER · 4fc29989

由 Yajun Deng 提交于 9月 18, 2021

It no need barrier when assigning a NULL value to an RCU protected
pointer. So use RCU_INIT_POINTER() instead for more fast.
Signed-off-by: NYajun Deng <yajun.deng@linux.dev>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4fc29989

NET: IPV4: fix error "do not initialise globals to 0" · db9c8e2b

由 wangzhitong 提交于 9月 18, 2021

this patch fixes below Errors reported by checkpatch
    ERROR: do not initialise globals to 0
    +int cipso_v4_rbm_optfmt = 0;
Signed-off-by: Nwangzhitong <wangzhitong@uniontech.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

db9c8e2b

net: net_namespace: Fix undefined member in key_remove_domain() · aed0826b

由 Yajun Deng 提交于 9月 18, 2021

The key_domain member in struct net only exists if we define CONFIG_KEYS.
So we should add the define when we used key_domain.

Fixes: 9b242610 ("keys: Network namespace domain tag")
Signed-off-by: NYajun Deng <yajun.deng@linux.dev>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

aed0826b

18 9月, 2021 4 次提交

mptcp: add MPTCP_SUBFLOW_ADDRS getsockopt support · c11c5906

由 Florian Westphal 提交于 9月 17, 2021

This retrieves the address pairs of all subflows currently
active for a given mptcp connection.

It re-uses the same meta-header as for MPTCP_TCPINFO.

A new structure is provided to hold the subflow
address data:

struct mptcp_subflow_addrs {
	union {
		__kernel_sa_family_t sa_family;
		struct sockaddr sa_local;
		struct sockaddr_in sin_local;
		struct sockaddr_in6 sin6_local;
		struct sockaddr_storage ss_local;
	};
	union {
		struct sockaddr sa_remote;
		struct sockaddr_in sin_remote;
		struct sockaddr_in6 sin6_remote;
		struct sockaddr_storage ss_remote;
	};
};

Usage of the new getsockopt is very similar to
MPTCP_TCPINFO one.

Userspace allocates a
'struct mptcp_subflow_data', followed by one or
more 'struct mptcp_subflow_addrs', then inits the
mptcp_subflow_data structure as follows:

struct mptcp_subflow_addrs *sf_addr;
struct mptcp_subflow_data *addr;
socklen_t olen = sizeof(*addr) + (8 * sizeof(*sf_addr));

addr = malloc(olen);
addr->size_subflow_data = sizeof(*addr);
addr->num_subflows = 0;
addr->size_kernel = 0;
addr->size_user = sizeof(struct mptcp_subflow_addrs);

sf_addr = (struct mptcp_subflow_addrs *)(addr + 1);

and then retrieves the endpoint addresses via:
ret = getsockopt(fd, SOL_MPTCP, MPTCP_SUBFLOW_ADDRS,
		 addr, &olen);

If the call succeeds, kernel will have added up to 8
endpoint addresses after the 'mptcp_subflow_data' header.

Userspace needs to re-check 'olen' value to detect how
many bytes have been filled in by the kernel.

Userspace can check addr->num_subflows to discover when
there were more subflows that available data space.
Co-developed-by: NMatthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: NMatthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NMat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c11c5906

mptcp: add MPTCP_TCPINFO getsockopt support · 06f15cee

由 Florian Westphal 提交于 9月 17, 2021

Allow users to retrieve TCP_INFO data of all subflows.

Users need to pre-initialize a meta header that has to be
prepended to the data buffer that will be filled with the tcp info data.

The meta header looks like this:

struct mptcp_subflow_data {
 __u32 size_subflow_data;/* size of this structure in userspace */
 __u32 num_subflows;	/* must be 0, set by kernel */
 __u32 size_kernel;	/* must be 0, set by kernel */
 __u32 size_user;	/* size of one element in data[] */
} __attribute__((aligned(8)));

size_subflow_data has to be set to 'sizeof(struct mptcp_subflow_data)'.
This allows to extend mptcp_subflow_data structure later on without
breaking backwards compatibility.

If the structure is extended later on, kernel knows where the
userspace-provided meta header ends, even if userspace uses an older
(smaller) version of the structure.

num_subflows must be set to 0. If the getsockopt request succeeds (return
value is 0), it will be updated to contain the number of active subflows
for the given logical connection.

size_kernel must be set to 0. If the getsockopt request is successful,
it will contain the size of the 'struct tcp_info' as known by the kernel.
This is informational only.

size_user must be set to 'sizeof(struct tcp_info)'.

This allows the kernel to only fill in the space reserved/expected by
userspace.

Example:

struct my_tcp_info {
  struct mptcp_subflow_data d;
  struct tcp_info ti[2];
};
struct my_tcp_info ti;
socklen_t olen;

memset(&ti, 0, sizeof(ti));

ti.d.size_subflow_data = sizeof(struct mptcp_subflow_data);
ti.d.size_user = sizeof(struct tcp_info);
olen = sizeof(ti);

ret = getsockopt(fd, SOL_MPTCP, MPTCP_TCPINFO, &ti, &olen);
if (ret < 0)
	die_perror("getsockopt MPTCP_TCPINFO");

mptcp_subflow_data.num_subflows is populated with the number of
subflows that exist on the kernel side for the logical mptcp connection.

This allows userspace to re-try with a larger tcp_info array if the number
of subflows was larger than the available space in the ti[] array.

olen has to be set to the number of bytes that userspace has allocated to
receive the kernel data.  It will be updated to contain the real number
bytes that have been copied to by the kernel.

In the above example, if the number if subflows was 1, olen is equal to
'sizeof(struct mptcp_subflow_data) + sizeof(struct tcp_info).
For 2 or more subflows olen is equal to 'sizeof(struct my_tcp_info)'.

If there was more data that could not be copied due to lack of space
in the option buffer, userspace can detect this by checking
mptcp_subflow_data->num_subflows.
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NMat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

06f15cee

mptcp: add MPTCP_INFO getsockopt · 55c42fa7

由 Florian Westphal 提交于 9月 17, 2021

Its not compatible with multipath-tcp.org kernel one.

1. The out-of-tree implementation defines a different 'struct mptcp_info',
   with embedded __user addresses for additional data such as
   endpoint addresses.

2. Mat Martineau points out that embedded __user addresses doesn't work
with BPF_CGROUP_RUN_PROG_GETSOCKOPT() which assumes that copying in
optsize bytes from optval provides all data that got copied to userspace.

This provides mptcp_info data for the given mptcp socket.

Userspace sets optlen to the size of the structure it expects.
The kernel updates it to contain the number of bytes that it copied.

This allows to append more information to the structure later.
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NMat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

55c42fa7

mptcp: add new mptcp_fill_diag helper · 61bc6e82

由 Florian Westphal 提交于 9月 17, 2021

Will be re-used from getsockopt path.
Since diag can be a module, we can't export the helper from diag, it
needs to be moved to core.
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NMat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

61bc6e82

17 9月, 2021 1 次提交

devlink: Delete not-used devlink APIs · 6db9350a

由 Leon Romanovsky 提交于 9月 16, 2021

Devlink core exported generously the functions calls that were used
by netdevsim tests or not used at all.

Delete such APIs with one exception - devlink_alloc_ns(). That function
should be spared from deleting because it is a special form of devlink_alloc()
needed for the netdevsim.
Signed-off-by: NLeon Romanovsky <leonro@nvidia.com>
Acked-by: NJakub Kicinski <kuba@kernel.org>
Reviewed-by: NJiri Pirko <jiri@nvidia.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6db9350a

16 9月, 2021 3 次提交

net/tls: support SM4 GCM/CCM algorithm · 227b9644

由 Tianjia Zhang 提交于 9月 16, 2021

The RFC8998 specification defines the use of the ShangMi algorithm
cipher suites in TLS 1.3, and also supports the GCM/CCM mode using
the SM4 algorithm.
Signed-off-by: NTianjia Zhang <tianjia.zhang@linux.alibaba.com>
Acked-by: NJakub Kicinski <kuba@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

227b9644

net: dsa: flush switchdev workqueue before tearing down CPU/DSA ports · a57d8c21

由 Vladimir Oltean 提交于 9月 14, 2021

Sometimes when unbinding the mv88e6xxx driver on Turris MOX, these error
messages appear:

mv88e6085 d0032004.mdio-mii:12: port 1 failed to delete be:79:b4:9e:9e:96 vid 1 from fdb: -2
mv88e6085 d0032004.mdio-mii:12: port 1 failed to delete be:79:b4:9e:9e:96 vid 0 from fdb: -2
mv88e6085 d0032004.mdio-mii:12: port 1 failed to delete d8:58:d7:00:ca:6d vid 100 from fdb: -2
mv88e6085 d0032004.mdio-mii:12: port 1 failed to delete d8:58:d7:00:ca:6d vid 1 from fdb: -2
mv88e6085 d0032004.mdio-mii:12: port 1 failed to delete d8:58:d7:00:ca:6d vid 0 from fdb: -2

(and similarly for other ports)

What happens is that DSA has a policy "even if there are bugs, let's at
least not leak memory" and dsa_port_teardown() clears the dp->fdbs and
dp->mdbs lists, which are supposed to be empty.

But deleting that cleanup code, the warnings go away.

=> the FDB and MDB lists (used for refcounting on shared ports, aka CPU
and DSA ports) will eventually be empty, but are not empty by the time
we tear down those ports. Aka we are deleting them too soon.

The addresses that DSA complains about are host-trapped addresses: the
local addresses of the ports, and the MAC address of the bridge device.

The problem is that offloading those entries happens from a deferred
work item scheduled by the SWITCHDEV_FDB_DEL_TO_DEVICE handler, and this
races with the teardown of the CPU and DSA ports where the refcounting
is kept.

In fact, not only it races, but fundamentally speaking, if we iterate
through the port list linearly, we might end up tearing down the shared
ports even before we delete a DSA user port which has a bridge upper.

So as it turns out, we need to first tear down the user ports (and the
unused ones, for no better place of doing that), then the shared ports
(the CPU and DSA ports). In between, we need to ensure that all work
items scheduled by our switchdev handlers (which only run for user
ports, hence the reason why we tear them down first) have finished.

Fixes: 161ca59d ("net: dsa: reference count the MDB entries at the cross-chip notifier level")
Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
Link: https://lore.kernel.org/r/20210914134726.2305133-1-vladimir.oltean@nxp.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

a57d8c21

net: dsa: destroy the phylink instance on any error in dsa_slave_phy_setup · 6a52e733

由 Vladimir Oltean 提交于 9月 14, 2021

DSA supports connecting to a phy-handle, and has a fallback to a non-OF
based method of connecting to an internal PHY on the switch's own MDIO
bus, if no phy-handle and no fixed-link nodes were present.

The -ENODEV error code from the first attempt (phylink_of_phy_connect)
is what triggers the second attempt (phylink_connect_phy).

However, when the first attempt returns a different error code than
-ENODEV, this results in an unbalance of calls to phylink_create and
phylink_destroy by the time we exit the function. The phylink instance
has leaked.

There are many other error codes that can be returned by
phylink_of_phy_connect. For example, phylink_validate returns -EINVAL.
So this is a practical issue too.

Fixes: aab9c406 ("net: dsa: Plug in PHYLINK support")
Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
Reviewed-by: NRussell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://lore.kernel.org/r/20210914134331.2303380-1-vladimir.oltean@nxp.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

6a52e733

15 9月, 2021 3 次提交

devlink: Delete not-used single parameter notification APIs · c2d2f988

由 Leon Romanovsky 提交于 9月 14, 2021

There is no need in specific devlink_param_*publish(), because same
output can be achieved by using devlink_params_*publish() in correct
places.
Signed-off-by: NLeon Romanovsky <leonro@nvidia.com>
Acked-by: NJakub Kicinski <kuba@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c2d2f988

net: sched: update default qdisc visibility after Tx queue cnt changes · 1e080f17

由 Jakub Kicinski 提交于 9月 13, 2021

mq / mqprio make the default child qdiscs visible. They only do
so for the qdiscs which are within real_num_tx_queues when the
device is registered. Depending on order of calls in the driver,
or if user space changes config via ethtool -L the number of
qdiscs visible under tc qdisc show will differ from the number
of queues. This is confusing to users and potentially to system
configuration scripts which try to make sure qdiscs have the
right parameters.

Add a new Qdisc_ops callback and make relevant qdiscs TTRT.

Note that this uncovers the "shortcut" created by
commit 1f27cde3 ("net: sched: use pfifo_fast for non real queues")
The default child qdiscs beyond initial real_num_tx are always
pfifo_fast, no matter what the sysfs setting is. Fixing this
gets a little tricky because we'd need to keep a reference
on whatever the default qdisc was at the time of creation.
In practice this is likely an non-issue the qdiscs likely have
to be configured to non-default settings, so whatever user space
is doing such configuration can replace the pfifos... now that
it will see them.
Reported-by: NMatthew Massey <matthewmassey@fb.com>
Reviewed-by: NDave Taht <dave.taht@gmail.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1e080f17

net: dsa: tag_rtl4_a: Drop bit 9 from egress frames · 339133f6

由 Linus Walleij 提交于 9月 13, 2021

This drops the code setting bit 9 on egress frames on the
Realtek "type A" (RTL8366RB) frames.

This bit was set on ingress frames for unknown reason,
and was set on egress frames as the format of ingress
and egress frames was believed to be the same. As that
assumption turned out to be false, and since this bit
seems to have zero effect on the behaviour of the switch
let's drop this bit entirely.
Signed-off-by: NLinus Walleij <linus.walleij@linaro.org>
Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
Link: https://lore.kernel.org/r/20210913143156.1264570-1-linus.walleij@linaro.orgSigned-off-by: NJakub Kicinski <kuba@kernel.org>

339133f6

14 9月, 2021 9 次提交

skbuff: inline page_frag_alloc_align() · 32e3573f

由 Yajun Deng 提交于 9月 14, 2021

The __alloc_frag_align() is short, and only called by two functions,
so inline page_frag_alloc_align() for reduce the overhead of calls.
Reported-by: Nkernel test robot <oliver.sang@intel.com>
Signed-off-by: NYajun Deng <yajun.deng@linux.dev>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

32e3573f

ethtool: prevent endless loop if eeprom size is smaller than announced · b9bbc4c1

由 Heiner Kallweit 提交于 9月 13, 2021

It shouldn't happen, but can happen that readable eeprom size is smaller
than announced. Then we would be stuck in an endless loop here because
after reaching the actual end reads return eeprom.len = 0. I faced this
issue when making a mistake in driver development. Detect this scenario
and return an error.
Signed-off-by: NHeiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b9bbc4c1

Revert "Revert "ipv4: fix memory leaks in ip_cmsg_send() callers"" · d198b277

由 Eric Dumazet 提交于 9月 13, 2021

This reverts commit d7807a9a.

As mentioned in https://lkml.org/lkml/2021/9/13/1819
5 years old commit 91948309 ("ipv4: fix memory leaks in ip_cmsg_send() callers")
was a correct fix.

  ip_cmsg_send() can loop over multiple cmsghdr()

  If IP_RETOPTS has been successful, but following cmsghdr generates an error,
  we do not free ipc.ok

  If IP_RETOPTS is not successful, we have freed the allocated temporary space,
  not the one currently in ipc.opt.

Sure, code could be refactored, but let's not bring back old bugs.

Fixes: d7807a9a ("Revert "ipv4: fix memory leaks in ip_cmsg_send() callers"")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Yajun Deng <yajun.deng@linux.dev>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d198b277

tcp: fix tp->undo_retrans accounting in tcp_sacktag_one() · 4f884f39

由 zhenggy 提交于 9月 14, 2021

Commit 10d3be56 ("tcp-tso: do not split TSO packets at retransmit
time") may directly retrans a multiple segments TSO/GSO packet without
split, Since this commit, we can no longer assume that a retransmitted
packet is a single segment.

This patch fixes the tp->undo_retrans accounting in tcp_sacktag_one()
that use the actual segments(pcount) of the retransmitted packet.

Before that commit (10d3be56), the assumption underlying the
tp->undo_retrans-- seems correct.

Fixes: 10d3be56 ("tcp-tso: do not split TSO packets at retransmit time")
Signed-off-by: Nzhenggy <zhenggy@chinatelecom.cn>
Reviewed-by: NEric Dumazet <edumazet@google.com>
Acked-by: NYuchung Cheng <ycheng@google.com>
Acked-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4f884f39

net-caif: avoid user-triggerable WARN_ON(1) · 550ac9c1

由 Eric Dumazet 提交于 9月 13, 2021

syszbot triggers this warning, which looks something
we can easily prevent.

If we initialize priv->list_field in chnl_net_init(),
then always use list_del_init(), we can remove robust_list_del()
completely.

WARNING: CPU: 0 PID: 3233 at net/caif/chnl_net.c:67 robust_list_del net/caif/chnl_net.c:67 [inline]
WARNING: CPU: 0 PID: 3233 at net/caif/chnl_net.c:67 chnl_net_uninit+0xc9/0x2e0 net/caif/chnl_net.c:375
Modules linked in:
CPU: 0 PID: 3233 Comm: syz-executor.3 Not tainted 5.14.0-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:robust_list_del net/caif/chnl_net.c:67 [inline]
RIP: 0010:chnl_net_uninit+0xc9/0x2e0 net/caif/chnl_net.c:375
Code: 89 eb e8 3a a3 ba f8 48 89 d8 48 c1 e8 03 42 80 3c 28 00 0f 85 bf 01 00 00 48 81 fb 00 14 4e 8d 48 8b 2b 75 d0 e8 17 a3 ba f8 <0f> 0b 5b 5d 41 5c 41 5d e9 0a a3 ba f8 4c 89 e3 e8 02 a3 ba f8 4c
RSP: 0018:ffffc90009067248 EFLAGS: 00010202
RAX: 0000000000008780 RBX: ffffffff8d4e1400 RCX: ffffc9000fd34000
RDX: 0000000000040000 RSI: ffffffff88bb6e49 RDI: 0000000000000003
RBP: ffff88802cd9ee08 R08: 0000000000000000 R09: ffffffff8d0e6647
R10: ffffffff88bb6dc2 R11: 0000000000000000 R12: ffff88803791ae08
R13: dffffc0000000000 R14: 00000000e600ffce R15: ffff888073ed3480
FS:  00007fed10fa0700(0000) GS:ffff8880b9d00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000001b2c322000 CR3: 00000000164a6000 CR4: 00000000001506e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 register_netdevice+0xadf/0x1500 net/core/dev.c:10347
 ipcaif_newlink+0x4c/0x260 net/caif/chnl_net.c:468
 __rtnl_newlink+0x106d/0x1750 net/core/rtnetlink.c:3458
 rtnl_newlink+0x64/0xa0 net/core/rtnetlink.c:3506
 rtnetlink_rcv_msg+0x413/0xb80 net/core/rtnetlink.c:5572
 netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2504
 netlink_unicast_kernel net/netlink/af_netlink.c:1314 [inline]
 netlink_unicast+0x533/0x7d0 net/netlink/af_netlink.c:1340
 netlink_sendmsg+0x86d/0xdb0 net/netlink/af_netlink.c:1929
 sock_sendmsg_nosec net/socket.c:704 [inline]
 sock_sendmsg+0xcf/0x120 net/socket.c:724
 __sys_sendto+0x21c/0x320 net/socket.c:2036
 __do_sys_sendto net/socket.c:2048 [inline]
 __se_sys_sendto net/socket.c:2044 [inline]
 __x64_sys_sendto+0xdd/0x1b0 net/socket.c:2044
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x44/0xae

Fixes: cc36a070 ("net-caif: add CAIF netdevice")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

550ac9c1

net/smc: add generic netlink support for system EID · 3c572145

由 Karsten Graul 提交于 9月 14, 2021

With SMC-Dv2 users can configure if the static system EID should be used
during CLC handshake, or if only user EIDs are allowed.
Add generic netlink support to enable and disable the system EID, and
to retrieve the system EID and its current enabled state.
Signed-off-by: NKarsten Graul <kgraul@linux.ibm.com>
Reviewed-by: NGuvenc Gulce  <guvenc@linux.ibm.com>
Signed-off-by: NGuvenc Gulce <guvenc@linux.ibm.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3c572145

net/smc: keep static copy of system EID · 11a26c59

由 Karsten Graul 提交于 9月 14, 2021

The system EID is retrieved using an registered ISM device each time
when needed. This adds some unnecessary complexity at all places where
the system EID is needed, but no ISM device is at hand.
Simplify the code and save the system EID in a static variable in
smc_ism.c.
Signed-off-by: NKarsten Graul <kgraul@linux.ibm.com>
Reviewed-by: NGuvenc Gulce  <guvenc@linux.ibm.com>
Signed-off-by: NGuvenc Gulce <guvenc@linux.ibm.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

11a26c59

net/smc: add support for user defined EIDs · fa086662

由 Karsten Graul 提交于 9月 14, 2021

SMC-Dv2 allows users to define EIDs which allows to create separate
name spaces enabling users to cluster their SMC-Dv2 connections.
Add support for user defined EIDs and extent the generic netlink
interface so users can add, remove and dump EIDs.
Signed-off-by: NKarsten Graul <kgraul@linux.ibm.com>
Reviewed-by: NGuvenc Gulce  <guvenc@linux.ibm.com>
Signed-off-by: NGuvenc Gulce <guvenc@linux.ibm.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fa086662

bpf, cgroups: Fix cgroup v2 fallback on v1/v2 mixed mode · 8520e224

由 Daniel Borkmann 提交于 9月 14, 2021

Fix cgroup v1 interference when non-root cgroup v2 BPF programs are used.
Back in the days, commit bd1060a1 ("sock, cgroup: add sock->sk_cgroup")
embedded per-socket cgroup information into sock->sk_cgrp_data and in order
to save 8 bytes in struct sock made both mutually exclusive, that is, when
cgroup v1 socket tagging (e.g. net_cls/net_prio) is used, then cgroup v2
falls back to the root cgroup in sock_cgroup_ptr() (&cgrp_dfl_root.cgrp).

The assumption made was "there is no reason to mix the two and this is in line
with how legacy and v2 compatibility is handled" as stated in bd1060a1.
However, with Kubernetes more widely supporting cgroups v2 as well nowadays,
this assumption no longer holds, and the possibility of the v1/v2 mixed mode
with the v2 root fallback being hit becomes a real security issue.

Many of the cgroup v2 BPF programs are also used for policy enforcement, just
to pick _one_ example, that is, to programmatically deny socket related system
calls like connect(2) or bind(2). A v2 root fallback would implicitly cause
a policy bypass for the affected Pods.

In production environments, we have recently seen this case due to various
circumstances: i) a different 3rd party agent and/or ii) a container runtime
such as [0] in the user's environment configuring legacy cgroup v1 net_cls
tags, which triggered implicitly mentioned root fallback. Another case is
Kubernetes projects like kind [1] which create Kubernetes nodes in a container
and also add cgroup namespaces to the mix, meaning programs which are attached
to the cgroup v2 root of the cgroup namespace get attached to a non-root
cgroup v2 path from init namespace point of view. And the latter's root is
out of reach for agents on a kind Kubernetes node to configure. Meaning, any
entity on the node setting cgroup v1 net_cls tag will trigger the bypass
despite cgroup v2 BPF programs attached to the namespace root.

Generally, this mutual exclusiveness does not hold anymore in today's user
environments and makes cgroup v2 usage from BPF side fragile and unreliable.
This fix adds proper struct cgroup pointer for the cgroup v2 case to struct
sock_cgroup_data in order to address these issues; this implicitly also fixes
the tradeoffs being made back then with regards to races and refcount leaks
as stated in bd1060a1, and removes the fallback, so that cgroup v2 BPF
programs always operate as expected.

[0] https://github.com/nestybox/sysbox/
[1] https://kind.sigs.k8s.io/

Fixes: bd1060a1 ("sock, cgroup: add sock->sk_cgroup")
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Acked-by: NStanislav Fomichev <sdf@google.com>
Acked-by: NTejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/bpf/20210913230759.2313-1-daniel@iogearbox.net

8520e224

13 9月, 2021 5 次提交

nfc: do not break pr_debug() call into separate lines · 3537e507

由 Krzysztof Kozlowski 提交于 9月 13, 2021

Remove unneeded line break between pr_debug and arguments.
Signed-off-by: NKrzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3537e507

ipv6: delay fib6_sernum increase in fib6_add · e87b5052

由 zhang kai 提交于 9月 09, 2021

only increase fib6_sernum in net namespace after add fib6_info
successfully.
Signed-off-by: Nzhang kai <zhangkaiheb@126.com>
Reviewed-by: NDavid Ahern <dsahern@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e87b5052

tipc: increase timeout in tipc_sk_enqueue() · f4bb62e6

由 Hoang Le 提交于 9月 13, 2021

In tipc_sk_enqueue() we use hardcoded 2 jiffies to extract
socket buffer from generic queue to particular socket.
The 2 jiffies is too short in case there are other high priority
tasks get CPU cycles for multiple jiffies update. As result, no
buffer could be enqueued to particular socket.

To solve this, we switch to use constant timeout 20msecs.
Then, the function will be expired between 2 jiffies (CONFIG_100HZ)
and 20 jiffies (CONFIG_1000HZ).

Fixes: c637c103 ("tipc: resolve race problem at unicast message reception")
Acked-by: NJon Maloy <jmaloy@redhat.com>
Signed-off-by: NHoang Le <hoang.h.le@dektech.com.au>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f4bb62e6

udp_tunnel: Fix udp_tunnel_nic work-queue type · e50e7113

由 Aya Levin 提交于 9月 13, 2021

Turn udp_tunnel_nic work-queue to an ordered work-queue. This queue
holds the UDP-tunnel configuration commands of the different netdevs.
When the netdevs are functions of the same NIC the order of
execution may be crucial.

Problem example:
NIC with 2 PFs, both PFs declare offload quota of up to 3 UDP-ports.
 $ifconfig eth2 1.1.1.1/16 up

 $ip link add eth2_19503 type vxlan id 5049 remote 1.1.1.2 dev eth2 dstport 19053
 $ip link set dev eth2_19503 up

 $ip link add eth2_19504 type vxlan id 5049 remote 1.1.1.3 dev eth2 dstport 19054
 $ip link set dev eth2_19504 up

 $ip link add eth2_19505 type vxlan id 5049 remote 1.1.1.4 dev eth2 dstport 19055
 $ip link set dev eth2_19505 up

 $ip link add eth2_19506 type vxlan id 5049 remote 1.1.1.5 dev eth2 dstport 19056
 $ip link set dev eth2_19506 up

NIC RX port offload infrastructure offloads the first 3 UDP-ports (on
all devices which sets NETIF_F_RX_UDP_TUNNEL_PORT feature) and not
UDP-port 19056. So both PFs gets this offload configuration.

 $ip link set dev eth2_19504 down

This triggers udp-tunnel-core to remove the UDP-port 19504 from
offload-ports-list and offload UDP-port 19056 instead.

In this scenario it is important that the UDP-port of 19504 will be
removed from both PFs before trying to add UDP-port 19056. The NIC can
stop offloading a UDP-port only when all references are removed.
Otherwise the NIC may report exceeding of the offload quota.

Fixes: cc4e3835 ("udp_tunnel: add central NIC RX port offload infrastructure")
Signed-off-by: NAya Levin <ayal@nvidia.com>
Reviewed-by: NTariq Toukan <tariqt@nvidia.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e50e7113

Revert "ipv4: fix memory leaks in ip_cmsg_send() callers" · d7807a9a

由 Yajun Deng 提交于 9月 13, 2021

This reverts commit 91948309.

There is only when ip_options_get() return zero need to free.
It already called kfree() when return error.

Fixes: 91948309 ("ipv4: fix memory leaks in ip_cmsg_send() callers")
Signed-off-by: NYajun Deng <yajun.deng@linux.dev>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d7807a9a

11 9月, 2021 1 次提交

selftests/bpf: Test new __sk_buff field hwtstamp · 3384c7c7

由 Vadim Fedorenko 提交于 9月 10, 2021

Analogous to the gso_segs selftests introduced in commit d9ff286a
("bpf: allow BPF programs access skb_shared_info->gso_segs field").
Signed-off-by: NVadim Fedorenko <vfedorenko@novek.ru>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NMartin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20210909220409.8804-3-vfedorenko@novek.ru

3384c7c7

openeuler / Kernel 接近 3 年 前同步成功

openeuler / Kernel
接近 3 年前同步成功