提交 · 7439d687b79cbbd971c6a170be9aefda4a564be4 · openeuler / Kernel

01 12月, 2020 5 次提交

mptcp: avoid a few atomic ops in the rx path · 7439d687

由 Paolo Abeni 提交于 11月 27, 2020

Extending the data_lock scope in mptcp_incoming_option
we can use that to protect both snd_una and wnd_end.
In the typical case, we will have a single atomic op instead of 2
Acked-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
Reviewed-by: NMat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

7439d687

mptcp: allocate TX skbs in msk context · 724cfd2e

由 Paolo Abeni 提交于 11月 27, 2020

Move the TX skbs allocation in mptcp_sendmsg() scope,
and tentatively pre-allocate a skbs number proportional
to the sendmsg() length.

Use the ssk tx skb cache to prevent the subflow allocation.

This allows removing the msk skb extension cache and will
make possible the later patches.
Acked-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
Reviewed-by: NMat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

724cfd2e

mptcp: protect the rx path with the msk socket spinlock · 87952603

由 Paolo Abeni 提交于 11月 27, 2020

Such spinlock is currently used only to protect the 'owned'
flag inside the socket lock itself. With this patch, we extend
its scope to protect the whole msk receive path and
sk_forward_memory.

Given the above, we can always move data into the msk receive
queue (and OoO queue) from the subflow.

We leverage the previous commit, so that we need to acquire the
spinlock in the tx path only when moving fwd memory.

recvmsg() must now explicitly acquire the socket spinlock
when moving skbs out of sk_receive_queue. To reduce the number of
lock operations required we use a second rx queue and splice the
first into the latter in mptcp_lock_sock(). Additionally rmem
allocated memory is bulk-freed via release_cb()
Acked-by: NFlorian Westphal <fw@strlen.de>
Co-developed-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
Reviewed-by: NMat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

87952603

mptcp: implement wmem reservation · e93da928

由 Paolo Abeni 提交于 11月 27, 2020

This leverages the previous commit to reserve the wmem
required for the sendmsg() operation when the msk socket
lock is first acquired.
Some heuristics are used to get a reasonable [over] estimation of
the whole memory required. If we can't forward alloc such amount
fallback to a reasonable small chunk, otherwise enter the wait
for memory path.

When sendmsg() needs more memory it looks at wmem_reserved
first and if that is exhausted, move more space from
sk_forward_alloc.

The reserved memory is not persistent and is released at the
next socket unlock via the release_cb().

Overall this will simplify the next patch.
Acked-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
Reviewed-by: NMat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

e93da928

mptcp: open code mptcp variant for lock_sock · ad80b0fc

由 Paolo Abeni 提交于 11月 27, 2020

This allows invoking an additional callback under the
socket spin lock.

Will be used by the next patches to avoid additional
spin lock contention.
Acked-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
Reviewed-by: NMat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

ad80b0fc

29 11月, 2020 1 次提交

net/sched: act_ct: enable stats for HW offloaded entries · 3567e233

由 Marcelo Ricardo Leitner 提交于 11月 26, 2020

By setting NF_FLOWTABLE_COUNTER. Otherwise, the updates added by
commit ef803b3c ("netfilter: flowtable: add counter support in HW
offload") are not effective when using act_ct.

While at it, now that we have the flag set, protect the call to
nf_ct_acct_update() by commit beb97d3a ("net/sched: act_ct: update
nf_conn_acct for act_ct SW offload in flowtable") with the check on
NF_FLOWTABLE_COUNTER, as also done on other places.

Note that this shouldn't impact performance as these stats are only
enabled when net.netfilter.nf_conntrack_acct is enabled.
Signed-off-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Nwenxu <wenxu@ucloud.cn>
Acked-by: NPablo Neira Ayuso <pablo@netfilter.org>
Link: https://lore.kernel.org/r/481a65741261fd81b0a0813e698af163477467ec.1606415787.git.marcelo.leitner@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

3567e233

28 11月, 2020 17 次提交

tipc: update address terminology in code · b6f88d9c

由 Jon Maloy 提交于 11月 25, 2020

We update the terminology in the code so that deprecated structure
names and macros are replaced with those currently recommended in
the user API.

struct tipc_portid   -> struct tipc_socket_addr
struct tipc_name     -> struct tipc_service_addr
struct tipc_name_seq -> struct tipc_service_range

TIPC_ADDR_ID       -> TIPC_SOCKET_ADDR
TIPC_ADDR_NAME     -> TIPC_SERVICE_ADDR
TIPC_ADDR_NAMESEQ  -> TIPC_SERVICE_RANGE
TIPC_CFG_SRV       -> TIPC_NODE_STATE
Acked-by: NYing Xue <ying.xue@windriver.com>
Signed-off-by: NJon Maloy <jmaloy@redhat.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

b6f88d9c

tipc: make node number calculation reproducible · 5f75e0a0

由 Jon Maloy 提交于 11月 25, 2020

The 32-bit node number, aka node hash or node address, is calculated
based on the 128-bit node identity when it is not set explicitly by
the user. In future commits we will need to perform this hash operation
on peer nodes while feeling safe that we obtain the same result.

We do this by interpreting the initial hash as a network byte order
number. Whenever we need to use the number locally on a node
we must therefore translate it to host byte order to obtain an
architecure independent result.

Furthermore, given the context where we use this number, we must not
allow it to be zero unless the node identity also is zero. Hence, in
the rare cases when the xor-ed hash value may end up as zero we replace
it with a fix number, knowing that the code anyway is capable of
handling hash collisions.
Acked-by: NYing Xue <ying.xue@windriver.com>
Signed-off-by: NJon Maloy <jmaloy@redhat.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

5f75e0a0

tipc: refactor tipc_sk_bind() function · 60c102ee

由 Jon Maloy 提交于 11月 25, 2020

We refactor the tipc_sk_bind() function, so that the lock handling
is handled separately from the logics. We also move some sanity
tests to earlier in the call chain, to the function tipc_bind().
Acked-by: NYing Xue <ying.xue@windriver.com>
Signed-off-by: NJon Maloy <jmaloy@redhat.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

60c102ee

net/x25: remove x25_kill_by_device() · 139d6eb1

由 Martin Schiller 提交于 11月 26, 2020

Remove obsolete function x25_kill_by_device(). It's not used any more.
Signed-off-by: NMartin Schiller <ms@dev.tdt.de>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

139d6eb1

net/x25: fix restart request/confirm handling · d023b2b9

由 Martin Schiller 提交于 11月 26, 2020

We have to take the actual link state into account to handle
restart requests/confirms well.
Signed-off-by: NMartin Schiller <ms@dev.tdt.de>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

d023b2b9

net/lapb: fix t1 timer handling for LAPB_STATE_0 · 62480b99

由 Martin Schiller 提交于 11月 26, 2020

1. DTE interface changes immediately to LAPB_STATE_1 and start sending
   SABM(E).

2. DCE interface sends N2-times DM and changes to LAPB_STATE_1
   afterwards if there is no response in the meantime.
Signed-off-by: NMartin Schiller <ms@dev.tdt.de>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

62480b99

net/lapb: support netdev events · a4989fa9

由 Martin Schiller 提交于 11月 26, 2020

This patch allows layer2 (LAPB) to react to netdev events itself and
avoids the detour via layer3 (X.25).

1. Establish layer2 on NETDEV_UP events, if the carrier is already up.

2. Call lapb_disconnect_request() on NETDEV_GOING_DOWN events to signal
   the peer that the connection will go down.
   (Only when the carrier is up.)

3. When a NETDEV_DOWN event occur, clear all queues, enter state
   LAPB_STATE_0 and stop all timers.

4. The NETDEV_CHANGE event makes it possible to handle carrier loss and
   detection.

   In case of Carrier Loss, clear all queues, enter state LAPB_STATE_0
   and stop all timers.

   In case of Carrier Detection, we start timer t1 on a DCE interface,
   and on a DTE interface we change to state LAPB_STATE_1 and start
   sending SABM(E).
Signed-off-by: NMartin Schiller <ms@dev.tdt.de>
Acked-by: NXie He <xie.he.0141@gmail.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

a4989fa9

net/x25: handle additional netdev events · 7eed751b

由 Martin Schiller 提交于 11月 26, 2020

1. Add / remove x25_link_device by NETDEV_REGISTER/UNREGISTER and also
   by NETDEV_POST_TYPE_CHANGE/NETDEV_PRE_TYPE_CHANGE.

   This change is needed so that the x25_neigh struct for an interface
   is already created when it shows up and is kept independently if the
   interface goes UP or DOWN.

   This is used in an upcomming commit, where x25 params of an neighbour
   will get configurable through ioctls.

2. NETDEV_CHANGE event makes it possible to handle carrier loss and
   detection. If carrier is lost, clean up everything related to this
   neighbour by calling x25_link_terminated().

3. Also call x25_link_terminated() for NETDEV_DOWN events and remove the
   call to x25_clear_forward_by_dev() in x25_route_device_down(), as
   this is already called by x25_kill_by_neigh() which gets called by
   x25_link_terminated().

4. Do nothing for NETDEV_UP and NETDEV_GOING_DOWN events, as these will
   be handled in layer 2 (LAPB) and layer3 (X.25) will be informed by
   layer2 when layer2 link is established and layer3 link should be
   initiated.
Signed-off-by: NMartin Schiller <ms@dev.tdt.de>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

7eed751b

net/sched: sch_frag: add generic packet fragment support. · c129412f

由 wenxu 提交于 11月 25, 2020

Currently kernel tc subsystem can do conntrack in cat_ct. But when several
fragment packets go through the act_ct, function tcf_ct_handle_fragments
will defrag the packets to a big one. But the last action will redirect
mirred to a device which maybe lead the reassembly big packet over the mtu
of target device.

This patch add support for a xmit hook to mirred, that gets executed before
xmiting the packet. Then, when act_ct gets loaded, it configs that hook.
The frag xmit hook maybe reused by other modules.
Signed-off-by: Nwenxu <wenxu@ucloud.cn>
Acked-by: NCong Wang <cong.wang@bytedance.com>
Acked-by: NJamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

c129412f

net/sched: act_mirred: refactor the handle of xmit · fa6d6399

由 wenxu 提交于 11月 25, 2020

This one is prepare for the next patch.
Signed-off-by: Nwenxu <wenxu@ucloud.cn>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

fa6d6399

net/sched: fix miss init the mru in qdisc_skb_cb · aadaca9e

由 wenxu 提交于 11月 25, 2020

The mru in the qdisc_skb_cb should be init as 0. Only defrag packets in the
act_ct will set the value.

Fixes: 038ebb1a ("net/sched: act_ct: fix miss set mru for ovs after defrag in act_ct")
Signed-off-by: Nwenxu <wenxu@ucloud.cn>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

aadaca9e

net/tls: add CHACHA20-POLY1305 configuration · 74ea6106

由 Vadim Fedorenko 提交于 11月 24, 2020

Add ChaCha-Poly specific configuration code.
Signed-off-by: NVadim Fedorenko <vfedorenko@novek.ru>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

74ea6106

net/tls: add CHACHA20-POLY1305 specific behavior · a6acbe62

由 Vadim Fedorenko 提交于 11月 24, 2020

RFC 7905 defines special behavior for ChaCha-Poly TLS sessions.
The differences are in the calculation of nonce and the absence
of explicit IV. This behavior is like TLSv1.3 partly.
Signed-off-by: NVadim Fedorenko <vfedorenko@novek.ru>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

a6acbe62

net/tls: make inline helpers protocol-aware · 6942a284

由 Vadim Fedorenko 提交于 11月 24, 2020

Inline functions defined in tls.h have a lot of AES-specific
constants. Remove these constants and change argument to struct
tls_prot_info to have an access to cipher type in later patches
Signed-off-by: NVadim Fedorenko <vfedorenko@novek.ru>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

6942a284

sock: set sk_err to ee_errno on dequeue from errq · 985f7337

由 Willem de Bruijn 提交于 11月 26, 2020

When setting sk_err, set it to ee_errno, not ee_origin.

Commit f5f99309 ("sock: do not set sk_err in
sock_dequeue_err_skb") disabled updating sk_err on errq dequeue,
which is correct for most error types (origins):

  -       sk->sk_err = err;

Commit 38b25793 ("sock: reset sk_err when the error queue is
empty") reenabled the behavior for IMCP origins, which do require it:

  +       if (icmp_next)
  +               sk->sk_err = SKB_EXT_ERR(skb_next)->ee.ee_origin;

But read from ee_errno.

Fixes: 38b25793 ("sock: reset sk_err when the error queue is empty")
Reported-by: NAyush Ranjan <ayushranjan@google.com>
Signed-off-by: NWillem de Bruijn <willemb@google.com>
Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
Link: https://lore.kernel.org/r/20201126151220.2819322-1-willemdebruijn.kernel@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

985f7337

mptcp: fix NULL ptr dereference on bad MPJ · d3ab7885

由 Paolo Abeni 提交于 11月 26, 2020

If an msk listener receives an MPJ carrying an invalid token, it
will zero the request socket msk entry. That should later
cause fallback and subflow reset - as per RFC - at
subflow_syn_recv_sock() time due to failing hmac validation.

Since commit 4cf8b7e4 ("subflow: introduce and use
mptcp_can_accept_new_subflow()"), we unconditionally dereference
- in mptcp_can_accept_new_subflow - the subflow request msk
before performing hmac validation. In the above scenario we
hit a NULL ptr dereference.

Address the issue doing the hmac validation earlier.

Fixes: 4cf8b7e4 ("subflow: introduce and use mptcp_can_accept_new_subflow()")
Tested-by: NDavide Caratti <dcaratti@redhat.com>
Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
Reviewed-by: NMatthieu Baerts <matthieu.baerts@tessares.net>
Link: https://lore.kernel.org/r/03b2cfa3ac80d8fc18272edc6442a9ddf0b1e34e.1606400227.git.pabeni@redhat.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

d3ab7885

net: openvswitch: fix TTL decrement action netlink message format · 69929d4c

由 Eelco Chaudron 提交于 11月 24, 2020

Currently, the openvswitch module is not accepting the correctly formated
netlink message for the TTL decrement action. For both setting and getting
the dec_ttl action, the actions should be nested in the
OVS_DEC_TTL_ATTR_ACTION attribute as mentioned in the openvswitch.h uapi.

When the original patch was sent, it was tested with a private OVS userspace
implementation. This implementation was unfortunately not upstreamed and
reviewed, hence an erroneous version of this patch was sent out.

Leaving the patch as-is would cause problems as the kernel module could
interpret additional attributes as actions and vice-versa, due to the
actions not being encapsulated/nested within the actual attribute, but
being concatinated after it.

Fixes: 744676e7 ("openvswitch: add TTL decrement action")
Signed-off-by: NEelco Chaudron <echaudro@redhat.com>
Link: https://lore.kernel.org/r/160622121495.27296.888010441924340582.stgit@wsfd-netdev64.ntdv.lab.eng.bos.redhat.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

69929d4c

27 11月, 2020 1 次提交

can: af_can: can_rx_unregister(): remove WARN() statement from list operation sanity check · d73ff9b7

由 Oliver Hartkopp 提交于 11月 26, 2020

To detect potential bugs in CAN protocol implementations (double removal of
receiver entries) a WARN() statement has been used if no matching list item was
found for removal.

The fault injection issued by syzkaller was able to create a situation where
the closing of a socket runs simultaneously to the notifier call chain for
removing the CAN network device in use.

This case is very unlikely in real life but it doesn't break anything.
Therefore we just replace the WARN() statement with pr_warn() to preserve the
notification for the CAN protocol development.

Reported-by: syzbot+381d06e0c8eaacb8706f@syzkaller.appspotmail.com
Reported-by: syzbot+d0ddd88c9a7432f041e6@syzkaller.appspotmail.com
Reported-by: syzbot+76d62d3b8162883c7d11@syzkaller.appspotmail.com
Signed-off-by: NOliver Hartkopp <socketcan@hartkopp.net>
Link: https://lore.kernel.org/r/20201126192140.14350-1-socketcan@hartkopp.netSigned-off-by: NMarc Kleine-Budde <mkl@pengutronix.de>

d73ff9b7

26 11月, 2020 10 次提交

net/tls: Protect from calling tls_dev_del for TLS RX twice · 025cc2fb

由 Maxim Mikityanskiy 提交于 11月 25, 2020

tls_device_offload_cleanup_rx doesn't clear tls_ctx->netdev after
calling tls_dev_del if TLX TX offload is also enabled. Clearing
tls_ctx->netdev gets postponed until tls_device_gc_task. It leaves a
time frame when tls_device_down may get called and call tls_dev_del for
RX one extra time, confusing the driver, which may lead to a crash.

This patch corrects this racy behavior by adding a flag to prevent
tls_device_down from calling tls_dev_del the second time.

Fixes: e8f69799 ("net/tls: Add generic NIC offload infrastructure")
Signed-off-by: NMaxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: NSaeed Mahameed <saeedm@nvidia.com>
Link: https://lore.kernel.org/r/20201125221810.69870-1-saeedm@nvidia.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

025cc2fb

devlink: Make sure devlink instance and port are in same net namespace · a7b43649

由 Parav Pandit 提交于 11月 25, 2020

When devlink reload operation is not used, netdev of an Ethernet port may
be present in different net namespace than the net namespace of the
devlink instance.

Ensure that both the devlink instance and devlink port netdev are located
in same net namespace.

Fixes: 070c63f2 ("net: devlink: allow to change namespaces during reload")
Signed-off-by: NParav Pandit <parav@nvidia.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

a7b43649

devlink: Hold rtnl lock while reading netdev attributes · b187c9b4

由 Parav Pandit 提交于 11月 25, 2020

A netdevice of a devlink port can be moved to different net namespace
than its parent devlink instance.
This scenario occurs when devlink reload is not used.

When netdevice is undergoing migration to net namespace, its ifindex
and name may change.

In such use case, devlink port query may read stale netdev attributes.

Fix it by reading them under rtnl lock.

Fixes: bfcd3a46 ("Introduce devlink infrastructure")
Signed-off-by: NParav Pandit <parav@nvidia.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

b187c9b4

net: Use lockdep_assert_in_softirq() in napi_consume_skb() · 6454eca8

由 Yunsheng Lin 提交于 11月 24, 2020

Use napi_consume_skb() to assert the case when it is not called
in a atomic softirq context.
Signed-off-by: NYunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

6454eca8

mptcp: be careful on MPTCP-level ack. · fd897679

由 Paolo Abeni 提交于 11月 24, 2020

We can enter the main mptcp_recvmsg() loop even when
no subflows are connected. As note by Eric, that would
result in a divide by zero oops on ack generation.

Address the issue by checking the subflow status before
sending the ack.

Additionally protect mptcp_recvmsg() against invocation
with weird socket states.

v1 -> v2:
 - removed unneeded inline keyword - Jakub
Reported-and-suggested-by: NEric Dumazet <eric.dumazet@gmail.com>
Fixes: ea4ca586 ("mptcp: refine MPTCP-level ack scheduling")
Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/5370c0ae03449239e3d1674ddcfb090cf6f20abe.1606253206.git.pabeni@redhat.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

fd897679

bridge: mrp: Implement LC mode for MRP · bfd04232

由 Horatiu Vultur 提交于 11月 24, 2020

Extend MRP to support LC mode(link check) for the interconnect port.
This applies only to the interconnect ring.

Opposite to RC mode(ring check) the LC mode is using CFM frames to
detect when the link goes up or down and based on that the userspace
will need to react.
One advantage of the LC mode over RC mode is that there will be fewer
frames in the normal rings. Because RC mode generates InTest on all
ports while LC mode sends CFM frame only on the interconnect port.

All 4 nodes part of the interconnect ring needs to have the same mode.
And it is not possible to have running LC and RC mode at the same time
on a node.

Whenever the MIM starts it needs to detect the status of the other 3
nodes in the interconnect ring so it would send a frame called
InLinkStatus, on which the clients needs to reply with their link
status.

This patch adds InLinkStatus frame type and extends existing rules on
how to forward this frame.
Acked-by: NNikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: NHoratiu Vultur <horatiu.vultur@microchip.com>
Link: https://lore.kernel.org/r/20201124082525.273820-1-horatiu.vultur@microchip.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

bfd04232

net: sched: alias action flags with TCA_ACT_ prefix · f460019b

由 Vlad Buslov 提交于 11月 24, 2020

Currently both filter and action flags use same "TCA_" prefix which makes
them hard to distinguish to code and confusing for users. Create aliases
for existing action flags constants with "TCA_ACT_" prefix.
Signed-off-by: NVlad Buslov <vlad@buslov.dev>
Link: https://lore.kernel.org/r/20201124164054.893168-1-vlad@buslov.devSigned-off-by: NJakub Kicinski <kuba@kernel.org>

f460019b

mptcp: put reference in mptcp timeout timer · b6d69fc8

由 Florian Westphal 提交于 11月 24, 2020

On close this timer might be scheduled. mptcp uses sk_reset_timer for
this, so the a reference on the mptcp socket is taken.

This causes a refcount leak which can for example be reproduced
with 'mp_join_server_v4.pkt' from the mptcp-packetdrill repo.

The leak has nothing to do with join requests, v1_mp_capable_bind_no_cs.pkt
works too when replacing the last ack mpcapable to v1 instead of v0.

unreferenced object 0xffff888109bba040 (size 2744):
  comm "packetdrill", [..]
  backtrace:
    [..] sk_prot_alloc.isra.0+0x2b/0xc0
    [..] sk_clone_lock+0x2f/0x740
    [..] mptcp_sk_clone+0x33/0x1a0
    [..] subflow_syn_recv_sock+0x2b1/0x690 [..]

Fixes: e16163b6 ("mptcp: refactor shutdown and close")
Cc: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Acked-by: NPaolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/20201124162446.11448-1-fw@strlen.deSigned-off-by: NJakub Kicinski <kuba@kernel.org>

b6d69fc8

gro_cells: reduce number of synchronize_net() calls · 2543a600

由 Eric Dumazet 提交于 11月 24, 2020

After cited commit, gro_cells_destroy() became damn slow
on hosts with a lot of cores.

This is because we have one additional synchronize_net() per cpu as
stated in the changelog.

gro_cells_init() is setting NAPI_STATE_NO_BUSY_POLL, and this was enough
to not have one synchronize_net() call per netif_napi_del()

We can factorize all the synchronize_net() to a single one,
right before freeing per-cpu memory.

Fixes: 5198d545 ("net: remove napi_hash_del() from driver-facing API")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20201124203822.1360107-1-eric.dumazet@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

2543a600

ipv6: addrlabel: fix possible memory leak in ip6addrlbl_net_init · e255e11e

由 Wang Hai 提交于 11月 24, 2020

kmemleak report a memory leak as follows:

unreferenced object 0xffff8880059c6a00 (size 64):
  comm "ip", pid 23696, jiffies 4296590183 (age 1755.384s)
  hex dump (first 32 bytes):
    20 01 00 10 00 00 00 00 00 00 00 00 00 00 00 00   ...............
    1c 00 00 00 00 00 00 00 00 00 00 00 07 00 00 00  ................
  backtrace:
    [<00000000aa4e7a87>] ip6addrlbl_add+0x90/0xbb0
    [<0000000070b8d7f1>] ip6addrlbl_net_init+0x109/0x170
    [<000000006a9ca9d4>] ops_init+0xa8/0x3c0
    [<000000002da57bf2>] setup_net+0x2de/0x7e0
    [<000000004e52d573>] copy_net_ns+0x27d/0x530
    [<00000000b07ae2b4>] create_new_namespaces+0x382/0xa30
    [<000000003b76d36f>] unshare_nsproxy_namespaces+0xa1/0x1d0
    [<0000000030653721>] ksys_unshare+0x3a4/0x780
    [<0000000007e82e40>] __x64_sys_unshare+0x2d/0x40
    [<0000000031a10c08>] do_syscall_64+0x33/0x40
    [<0000000099df30e7>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

We should free all rules when we catch an error in ip6addrlbl_net_init().
otherwise a memory leak will occur.

Fixes: 2a8cc6c8 ("[IPV6] ADDRCONF: Support RFC3484 configurable address selection policy table.")
Reported-by: NHulk Robot <hulkci@huawei.com>
Signed-off-by: NWang Hai <wanghai38@huawei.com>
Link: https://lore.kernel.org/r/20201124071728.8385-1-wanghai38@huawei.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

e255e11e

25 11月, 2020 4 次提交

tcp: Set ECT0 bit in tos/tclass for synack when BPF needs ECN · 407c85c7

由 Alexander Duyck 提交于 11月 20, 2020

When a BPF program is used to select between a type of TCP congestion
control algorithm that uses either ECN or not there is a case where the
synack for the frame was coming up without the ECT0 bit set. A bit of
research found that this was due to the final socket being configured to
dctcp while the listener socket was staying in cubic.

To reproduce it all that is needed is to monitor TCP traffic while running
the sample bpf program "samples/bpf/tcp_cong_kern.c". What is observed,
assuming tcp_dctcp module is loaded or compiled in and the traffic matches
the rules in the sample file, is that for all frames with the exception of
the synack the ECT0 bit is set.

To address that it is necessary to make one additional call to
tcp_bpf_ca_needs_ecn using the request socket and then use the output of
that to set the ECT0 bit for the tos/tclass of the packet.

Fixes: 91b5b21c ("bpf: Add support for changing congestion control")
Signed-off-by: NAlexander Duyck <alexanderduyck@fb.com>
Link: https://lore.kernel.org/r/160593039663.2604.1374502006916871573.stgit@localhost.localdomainSigned-off-by: NJakub Kicinski <kuba@kernel.org>

407c85c7

net: warn if gso_type isn't set for a GSO SKB · 1d155dfd

由 Heiner Kallweit 提交于 11月 21, 2020

In bug report [0] a warning in r8169 driver was reported that was
caused by an invalid GSO SKB (gso_type was 0). See [1] for a discussion
about this issue. Still the origin of the invalid GSO SKB isn't clear.

It shouldn't be a network drivers task to check for invalid GSO SKB's.
Also, even if issue [0] can be fixed, we can't be sure that a
similar issue doesn't pop up again at another place.
Therefore let gso_features_check() check for such invalid GSO SKB's.

[0] https://bugzilla.kernel.org/show_bug.cgi?id=209423
[1] https://www.spinics.net/lists/netdev/msg690794.htmlSigned-off-by: NHeiner Kallweit <hkallweit1@gmail.com>
Link: https://lore.kernel.org/r/97c78d21-7f0b-d843-df17-3589f224d2cf@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

1d155dfd

devlink: Fix reload stats structure · 5204bb68

由 Moshe Shemesh 提交于 11月 23, 2020

Fix reload stats structure exposed to the user. Change stats structure
hierarchy to have the reload action as a parent of the stat entry and
then stat entry includes value per limit. This will also help to avoid
string concatenation on iproute2 output.

Reload stats structure before this fix:
"stats": {
    "reload": {
        "driver_reinit": 2,
        "fw_activate": 1,
        "fw_activate_no_reset": 0
     }
}

After this fix:
"stats": {
    "reload": {
        "driver_reinit": {
            "unspecified": 2
        },
        "fw_activate": {
            "unspecified": 1,
            "no_reset": 0
        }
}

Fixes: a254c264 ("devlink: Add reload stats")
Signed-off-by: NMoshe Shemesh <moshe@mellanox.com>
Reviewed-by: NJiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/1606109785-25197-1-git-send-email-moshe@mellanox.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

5204bb68

devlink: Add blackhole_nexthop trap · f0a5013e

由 Ido Schimmel 提交于 11月 23, 2020

Add a packet trap to report packets that were dropped due to a
blackhole nexthop.
Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
Reviewed-by: NJiri Pirko <jiri@nvidia.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

f0a5013e

24 11月, 2020 2 次提交

sctp: Fix some typo · 5112cf59

由 Christophe JAILLET 提交于 11月 22, 2020

s/tranport/transport/
Signed-off-by: NChristophe JAILLET <christophe.jaillet@wanadoo.fr>
Link: https://lore.kernel.org/r/20201122180704.1366636-1-christophe.jaillet@wanadoo.frSigned-off-by: NJakub Kicinski <kuba@kernel.org>

5112cf59

net/packet: fix packet receive on L3 devices without visible hard header · d5496990

由 Eyal Birger 提交于 11月 21, 2020

In the patchset merged by commit b9fcf0a0
("Merge branch 'support-AF_PACKET-for-layer-3-devices'") L3 devices which
did not have header_ops were given one for the purpose of protocol parsing
on af_packet transmit path.

That change made af_packet receive path regard these devices as having a
visible L3 header and therefore aligned incoming skb->data to point to the
skb's mac_header. Some devices, such as ipip, xfrmi, and others, do not
reset their mac_header prior to ingress and therefore their incoming
packets became malformed.

Ideally these devices would reset their mac headers, or af_packet would be
able to rely on dev->hard_header_len being 0 for such cases, but it seems
this is not the case.

Fix by changing af_packet RX ll visibility criteria to include the
existence of a '.create()' header operation, which is used when creating
a device hard header - via dev_hard_header() - by upper layers, and does
not exist in these L3 devices.

As this predicate may be useful in other situations, add it as a common
dev_has_header() helper in netdevice.h.

Fixes: b9fcf0a0 ("Merge branch 'support-AF_PACKET-for-layer-3-devices'")
Signed-off-by: NEyal Birger <eyal.birger@gmail.com>
Acked-by: NJason A. Donenfeld <Jason@zx2c4.com>
Acked-by: NWillem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20201121062817.3178900-1-eyal.birger@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

d5496990

openeuler / Kernel 接近 2 年 前同步成功

openeuler / Kernel
接近 2 年前同步成功