提交 · 42bfba9eaa33dd4af0b50b87508062a41ec26653 · openeuler / Kernel

16 11月, 2019 3 次提交

net/smc: immediate termination for SMCD link groups · 42bfba9e

由 Ursula Braun 提交于 11月 14, 2019

SMCD link group termination is called when peer signals its shutdown
of its corresponding link group. For regular shutdowns no connections
exist anymore. For abnormal shutdowns connections must be killed and
their DMBs must be unregistered immediately. That means the SMCR method
to delay the link group freeing several seconds does not fit.

This patch adds immediate termination of a link group and its SMCD
connections and makes sure all SMCD link group related cleanup steps
are finished.
Signed-off-by: NUrsula Braun <ubraun@linux.ibm.com>
Signed-off-by: NKarsten Graul <kgraul@linux.ibm.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

42bfba9e

net/smc: fix final cleanup sequence for SMCD devices · 50c6b20e

由 Ursula Braun 提交于 11月 14, 2019

If peer announces shutdown, use the link group terminate worker for
local cleanup of link groups and connections to terminate link group
in proper context.

Make sure link groups are cleaned up first before destroying the
event queue of the SMCD device, because link group cleanup may
raise events.

Send signal shutdown only if peer has not done it already.

Send socket abort or close only, if peer has not already announced
shutdown.
Signed-off-by: NUrsula Braun <ubraun@linux.ibm.com>
Signed-off-by: NKarsten Graul <kgraul@linux.ibm.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

50c6b20e

net/tls: Fix unused function warning · d6649d78

由 YueHaibing 提交于 11月 14, 2019

If PROC_FS is not set, gcc warning this:

net/tls/tls_proc.c:23:12: warning:
 'tls_statistics_seq_show' defined but not used [-Wunused-function]

Use #ifdef to guard this.
Reported-by: NHulk Robot <hulkci@huawei.com>
Signed-off-by: NYueHaibing <yuehaibing@huawei.com>
Acked-by: NJakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d6649d78

15 11月, 2019 15 次提交

vsock: fix bind() behaviour taking care of CID · 36c5b48b

由 Stefano Garzarella 提交于 11月 14, 2019

When we are looking for a socket bound to a specific address,
we also have to take into account the CID.

This patch is useful with multi-transports support because it
allows the binding of the same port with different CID, and
it prevents a connection to a wrong socket bound to the same
port, but with different CID.
Reviewed-by: NStefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: NJorgen Hansen <jhansen@vmware.com>
Signed-off-by: NStefano Garzarella <sgarzare@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

36c5b48b

vsock: prevent transport modules unloading · 6a2c0962

由 Stefano Garzarella 提交于 11月 14, 2019

This patch adds 'module' member in the 'struct vsock_transport'
in order to get/put the transport module. This prevents the
module unloading while sockets are assigned to it.

We increase the module refcnt when a socket is assigned to a
transport, and we decrease the module refcnt when the socket
is destructed.
Reviewed-by: NStefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: NJorgen Hansen <jhansen@vmware.com>
Signed-off-by: NStefano Garzarella <sgarzare@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6a2c0962

vsock/vmci: register vmci_transport only when VMCI guest/host are active · b1bba80a

由 Stefano Garzarella 提交于 11月 14, 2019

To allow other transports to be loaded with vmci_transport,
we register the vmci_transport as G2H or H2G only when a VMCI guest
or host is active.

To do that, this patch adds a callback registered in the vmci driver
that will be called when the host or guest becomes active.
This callback will register the vmci_transport in the VSOCK core.

Cc: Jorgen Hansen <jhansen@vmware.com>
Signed-off-by: NStefano Garzarella <sgarzare@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b1bba80a

vsock: add multi-transports support · c0cfa2d8

由 Stefano Garzarella 提交于 11月 14, 2019

This patch adds the support of multiple transports in the
VSOCK core.

With the multi-transports support, we can use vsock with nested VMs
(using also different hypervisors) loading both guest->host and
host->guest transports at the same time.

Major changes:
- vsock core module can be loaded regardless of the transports
- vsock_core_init() and vsock_core_exit() are renamed to
  vsock_core_register() and vsock_core_unregister()
- vsock_core_register() has a feature parameter (H2G, G2H, DGRAM)
  to identify which directions the transport can handle and if it's
  support DGRAM (only vmci)
- each stream socket is assigned to a transport when the remote CID
  is set (during the connect() or when we receive a connection request
  on a listener socket).
  The remote CID is used to decide which transport to use:
  - remote CID <= VMADDR_CID_HOST will use guest->host transport;
  - remote CID == local_cid (guest->host transport) will use guest->host
    transport for loopback (host->guest transports don't support loopback);
  - remote CID > VMADDR_CID_HOST will use host->guest transport;
- listener sockets are not bound to any transports since no transport
  operations are done on it. In this way we can create a listener
  socket, also if the transports are not loaded or with VMADDR_CID_ANY
  to listen on all transports.
- DGRAM sockets are handled as before, since only the vmci_transport
  provides this feature.
Signed-off-by: NStefano Garzarella <sgarzare@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c0cfa2d8

hv_sock: set VMADDR_CID_HOST in the hvs_remote_addr_init() · 03964257

由 Stefano Garzarella 提交于 11月 14, 2019

Remote peer is always the host, so we set VMADDR_CID_HOST as
remote CID instead of VMADDR_CID_ANY.
Reviewed-by: NDexuan Cui <decui@microsoft.com>
Reviewed-by: NStefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: NStefano Garzarella <sgarzare@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

03964257

vsock: move vsock_insert_unbound() in the vsock_create() · 55f3e149

由 Stefano Garzarella 提交于 11月 14, 2019

vsock_insert_unbound() was called only when 'sock' parameter of
__vsock_create() was not null. This only happened when
__vsock_create() was called by vsock_create().

In order to simplify the multi-transports support, this patch
moves vsock_insert_unbound() at the end of vsock_create().
Reviewed-by: NDexuan Cui <decui@microsoft.com>
Reviewed-by: NStefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: NJorgen Hansen <jhansen@vmware.com>
Signed-off-by: NStefano Garzarella <sgarzare@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

55f3e149

vsock: add vsock_create_connected() called by transports · b9ca2f5f

由 Stefano Garzarella 提交于 11月 14, 2019

All transports call __vsock_create() with the same parameters,
most of them depending on the parent socket. In order to simplify
the VSOCK core APIs exposed to the transports, this patch adds
the vsock_create_connected() callable from transports to create
a new socket when a connection request is received.
We also unexported the __vsock_create().
Suggested-by: NStefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: NStefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: NJorgen Hansen <jhansen@vmware.com>
Signed-off-by: NStefano Garzarella <sgarzare@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b9ca2f5f

vsock: handle buffer_size sockopts in the core · b9f2b0ff

由 Stefano Garzarella 提交于 11月 14, 2019

virtio_transport and vmci_transport handle the buffer_size
sockopts in a very similar way.

In order to support multiple transports, this patch moves this
handling in the core to allow the user to change the options
also if the socket is not yet assigned to any transport.

This patch also adds the '.notify_buffer_size' callback in the
'struct virtio_transport' in order to inform the transport,
when the buffer_size is changed by the user. It is also useful
to limit the 'buffer_size' requested (e.g. virtio transports).
Acked-by: NDexuan Cui <decui@microsoft.com>
Reviewed-by: NStefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: NJorgen Hansen <jhansen@vmware.com>
Signed-off-by: NStefano Garzarella <sgarzare@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b9f2b0ff

vsock: add 'struct vsock_sock *' param to vsock_core_get_transport() · daabfbca

由 Stefano Garzarella 提交于 11月 14, 2019

Since now the 'struct vsock_sock' object contains a pointer to
the transport, this patch adds a parameter to the
vsock_core_get_transport() to return the right transport
assigned to the socket.

This patch modifies also the virtio_transport_get_ops(), that
uses the vsock_core_get_transport(), adding the
'struct vsock_sock *' parameter.
Reviewed-by: NStefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: NJorgen Hansen <jhansen@vmware.com>
Signed-off-by: NStefano Garzarella <sgarzare@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

daabfbca

vsock/virtio: add transport parameter to the virtio_transport_reset_no_sock() · 4c7246dc

由 Stefano Garzarella 提交于 11月 14, 2019

We are going to add 'struct vsock_sock *' parameter to
virtio_transport_get_ops().

In some cases, like in the virtio_transport_reset_no_sock(),
we don't have any socket assigned to the packet received,
so we can't use the virtio_transport_get_ops().

In order to allow virtio_transport_reset_no_sock() to use the
'.send_pkt' callback from the 'vhost_transport' or 'virtio_transport',
we add the 'struct virtio_transport *' to it and to its caller:
virtio_transport_recv_pkt().

We moved the 'vhost_transport' and 'virtio_transport' definition,
to pass their address to the virtio_transport_recv_pkt().
Reviewed-by: NStefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: NStefano Garzarella <sgarzare@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4c7246dc

vsock: add 'transport' member in the struct vsock_sock · fe502c4a

由 Stefano Garzarella 提交于 11月 14, 2019

As a preparation to support multiple transports, this patch adds
the 'transport' member at the 'struct vsock_sock'.
This new field is initialized during the creation in the
__vsock_create() function.

This patch also renames the global 'transport' pointer to
'transport_single', since for now we're only supporting a single
transport registered at run-time.
Reviewed-by: NStefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: NJorgen Hansen <jhansen@vmware.com>
Signed-off-by: NStefano Garzarella <sgarzare@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fe502c4a

vsock: remove include/linux/vm_sockets.h file · 3603a2e9

由 Stefano Garzarella 提交于 11月 14, 2019

This header file now only includes the "uapi/linux/vm_sockets.h".
We can include directly it when needed.
Reviewed-by: NStefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: NJorgen Hansen <jhansen@vmware.com>
Signed-off-by: NStefano Garzarella <sgarzare@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3603a2e9

vsock: remove vm_sockets_get_local_cid() · db205c76

由 Stefano Garzarella 提交于 11月 14, 2019

vm_sockets_get_local_cid() is only used in virtio_transport_common.c.
We can replace it calling the virtio_transport_get_ops() and
using the get_local_cid() callback registered by the transport.
Reviewed-by: NStefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: NJorgen Hansen <jhansen@vmware.com>
Signed-off-by: NStefano Garzarella <sgarzare@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

db205c76

vsock/vmci: remove unused VSOCK_DEFAULT_CONNECT_TIMEOUT · 7ed78bc4

由 Stefano Garzarella 提交于 11月 14, 2019

The VSOCK_DEFAULT_CONNECT_TIMEOUT definition was introduced with
commit d021c344 ("VSOCK: Introduce VM Sockets"), but it is
never used in the net/vmw_vsock/vmci_transport.c.

VSOCK_DEFAULT_CONNECT_TIMEOUT is used and defined in
net/vmw_vsock/af_vsock.c

Cc: Jorgen Hansen <jhansen@vmware.com>
Reviewed-by: NStefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: NJorgen Hansen <jhansen@vmware.com>
Signed-off-by: NStefano Garzarella <sgarzare@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7ed78bc4

net: openvswitch: add hash info to upcall · bd1903b7

由 Tonghao Zhang 提交于 11月 13, 2019

When using the kernel datapath, the upcall don't
include skb hash info relatived. That will introduce
some problem, because the hash of skb is important
in kernel stack. For example, VXLAN module uses
it to select UDP src port. The tx queue selection
may also use the hash in stack.

Hash is computed in different ways. Hash is random
for a TCP socket, and hash may be computed in hardware,
or software stack. Recalculation hash is not easy.

Hash of TCP socket is computed:
tcp_v4_connect
    -> sk_set_txhash (is random)

__tcp_transmit_skb
    -> skb_set_hash_from_sk

There will be one upcall, without information of skb
hash, to ovs-vswitchd, for the first packet of a TCP
session. The rest packets will be processed in Open vSwitch
modules, hash kept. If this tcp session is forward to
VXLAN module, then the UDP src port of first tcp packet
is different from rest packets.

TCP packets may come from the host or dockers, to Open vSwitch.
To fix it, we store the hash info to upcall, and restore hash
when packets sent back.

+---------------+          +-------------------------+
|   Docker/VMs  |          |     ovs-vswitchd        |
+----+----------+          +-+--------------------+--+
     |                       ^                    |
     |                       |                    |
     |                       |  upcall            v restore packet hash (not recalculate)
     |                     +-+--------------------+--+
     |  tap netdev         |                         |   vxlan module
     +--------------->     +-->  Open vSwitch ko     +-->
       or internal type    |                         |
                           +-------------------------+

Reported-at: https://mail.openvswitch.org/pipermail/ovs-dev/2019-October/364062.htmlSigned-off-by: NTonghao Zhang <xiangxia.m.yue@gmail.com>
Acked-by: NPravin B Shelar <pshelar@ovn.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

bd1903b7

13 11月, 2019 10 次提交

bridge: implement get_link_ksettings ethtool method · 542575fe

由 Matthias Schiffer 提交于 11月 12, 2019

We return the maximum speed of all active ports. This matches how the link
speed would give an upper limit for traffic to/from any single peer if the
bridge were replaced with a hardware switch.
Signed-off-by: NMatthias Schiffer <mschiffer@universe-factory.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

542575fe

net: dsa: Prevent usage of NET_DSA_TAG_8021Q as tagging protocol · 129bd7ca

由 Florian Fainelli 提交于 11月 11, 2019

It is possible for a switch driver to use NET_DSA_TAG_8021Q as a valid
DSA tagging protocol since it registers itself as such, unfortunately
since there are not xmit or rcv functions provided, the lack of a xmit()
function will lead to a NPD in dsa_slave_xmit() to start with.

net/dsa/tag_8021q.c is only comprised of a set of helper functions at
the moment, but is not a fully autonomous or functional tagging "driver"
(though it could become later on). We do not have any users of
NET_DSA_TAG_8021Q so now is a good time to make sure there are not
issues being encountered by making this file strictly a place holder for
helper functions.
Reviewed-by: NVladimir Oltean <olteanv@gmail.com>
Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

129bd7ca

tipc: update mon's self addr when node addr generated · 46cb01ee

由 Hoang Le 提交于 11月 12, 2019

In commit 25b0b9c4 ("tipc: handle collisions of 32-bit node address
hash values"), the 32-bit node address only generated after one second
trial period expired. However the self's addr in struct tipc_monitor do
not update according to node address generated. This lead to it is
always zero as initial value. As result, sorting algorithm using this
value does not work as expected, neither neighbor monitoring framework.

In this commit, we add a fix to update self's addr when 32-bit node
address generated.

Fixes: 25b0b9c4 ("tipc: handle collisions of 32-bit node address hash values")
Acked-by: NJon Maloy <jon.maloy@ericsson.com>
Signed-off-by: NHoang Le <hoang.h.le@dektech.com.au>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

46cb01ee

netfilter: nf_flow_table: hardware offload support · c29f74e0

由 Pablo Neira Ayuso 提交于 11月 12, 2019

This patch adds the dataplane hardware offload to the flowtable
infrastructure. Three new flags represent the hardware state of this
flow:

* FLOW_OFFLOAD_HW: This flow entry resides in the hardware.
* FLOW_OFFLOAD_HW_DYING: This flow entry has been scheduled to be remove
  from hardware. This might be triggered by either packet path (via TCP
  RST/FIN packet) or via aging.
* FLOW_OFFLOAD_HW_DEAD: This flow entry has been already removed from
  the hardware, the software garbage collector can remove it from the
  software flowtable.

This patch supports for:

* IPv4 only.
* Aging via FLOW_CLS_STATS, no packet and byte counter synchronization
  at this stage.

This patch also adds the action callback that specifies how to convert
the flow entry into the flow_rule object that is passed to the driver.
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c29f74e0

netfilter: nf_tables: add flowtable offload control plane · 8bb69f3b

由 Pablo Neira Ayuso 提交于 11月 12, 2019

This patch adds the NFTA_FLOWTABLE_FLAGS attribute that allows users to
specify the NF_FLOWTABLE_HW_OFFLOAD flag. This patch also adds a new
setup interface for the flowtable type to perform the flowtable offload
block callback configuration.
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8bb69f3b

netfilter: nf_flow_table: detach routing information from flow description · f1363e05

由 Pablo Neira Ayuso 提交于 11月 12, 2019

This patch adds the infrastructure to support for flow entry types.
The initial type is NF_FLOW_OFFLOAD_ROUTE that stores the routing
information into the flow entry to define a fastpath for the classic
forwarding path.
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f1363e05

netfilter: nf_flowtable: remove flow_offload_entry structure · 62248df8

由 Pablo Neira Ayuso 提交于 11月 12, 2019

Move rcu_head to struct flow_offload, then remove the flow_offload_entry
structure definition.
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

62248df8

netfilter: nf_flow_table: move conntrack object to struct flow_offload · b32d2f34

由 Pablo Neira Ayuso 提交于 11月 12, 2019

Simplify this code by storing the pointer to conntrack object in the
flow_offload structure.
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b32d2f34

net/sched: actions: remove unused 'order' · e0e2b35b

由 Davide Caratti 提交于 11月 12, 2019

after commit 4097e9d2 ("net: sched: don't use tc_action->order during
action dump"), 'act->order' is initialized but then it's no more read, so
we can just remove this member of struct tc_action.

CC: Ivan Vecera <ivecera@redhat.com>
Signed-off-by: NDavide Caratti <dcaratti@redhat.com>
Acked-by: NJiri Pirko <jiri@mellanox.com>
Reviewed-by: NIvan Vecera <ivecera@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e0e2b35b

devlink: Allow large formatted message of binary output · e2cde864

由 Aya Levin 提交于 11月 12, 2019

Devlink supports pair output of name and value. When the value is
binary, it must be presented in an array. If the length of the binary
value exceeds fmsg limitation, break the value into chunks internally.
Signed-off-by: NAya Levin <ayal@mellanox.com>
Acked-by: NJiri Pirko <jiri@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e2cde864

12 11月, 2019 5 次提交

tipc: fix update of the uninitialized variable err · c33fdc34

由 Colin Ian King 提交于 11月 11, 2019

Variable err is not uninitialized and hence can potentially contain
any garbage value.  This may cause an error when logical or'ing the
return values from the calls to functions crypto_aead_setauthsize or
crypto_aead_setkey.  Fix this by setting err to the return of
crypto_aead_setauthsize rather than or'ing in the return into the
uninitialized variable

Addresses-Coverity: ("Uninitialized scalar variable")
Fixes: fc1b6d6d ("tipc: introduce TIPC encryption & authentication")
Signed-off-by: NColin Ian King <colin.king@canonical.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c33fdc34

lwtunnel: ignore any TUNNEL_OPTIONS_PRESENT flags set by users · 0c06d166

由 Xin Long 提交于 11月 10, 2019

TUNNEL_OPTIONS_PRESENT (TUNNEL_GENEVE_OPT|TUNNEL_VXLAN_OPT|
TUNNEL_ERSPAN_OPT) flags should be set only according to
tb[LWTUNNEL_IP_OPTS], which is done in ip_tun_parse_opts().

When setting info key.tun_flags, the TUNNEL_OPTIONS_PRESENT
bits in tb[LWTUNNEL_IP(6)_FLAGS] passed from users should
be ignored.

While at it, replace all (TUNNEL_GENEVE_OPT|TUNNEL_VXLAN_OPT|
TUNNEL_ERSPAN_OPT) with 'TUNNEL_OPTIONS_PRESENT'.

Fixes: 3093fbe7 ("route: Per route IP tunnel metadata via lightweight tunnel")
Fixes: 32a2b002 ("ipv6: route: per route IP tunnel metadata via lightweight tunnel")
Signed-off-by: NXin Long <lucien.xin@gmail.com>
Reviewed-by: NSimon Horman <simon.horman@netronome.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0c06d166

lwtunnel: get nlsize for erspan options properly · 58e8494e

由 Xin Long 提交于 11月 10, 2019

erspan v1 has OPT_ERSPAN_INDEX while erspan v2 has OPT_ERSPAN_DIR and
OPT_ERSPAN_HWID attributes, and they require different nlsize when
dumping.

So this patch is to get nlsize for erspan options properly according
to erspan version.

Fixes: b0a21810 ("lwtunnel: add options setting and dumping for erspan")
Signed-off-by: NXin Long <lucien.xin@gmail.com>
Reviewed-by: NSimon Horman <simon.horman@netronome.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

58e8494e

lwtunnel: change to use nla_parse_nested on new options · ed02551f

由 Xin Long 提交于 11月 10, 2019

As the new options added in kernel, all should always use strict
parsing from the beginning with nla_parse_nested(), instead of
nla_parse_nested_deprecated().

Fixes: b0a21810 ("lwtunnel: add options setting and dumping for erspan")
Fixes: edf31cbb ("lwtunnel: add options setting and dumping for vxlan")
Fixes: 4ece4778 ("lwtunnel: add options setting and dumping for geneve")
Signed-off-by: NXin Long <lucien.xin@gmail.com>
Reviewed-by: NSimon Horman <simon.horman@netronome.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ed02551f

devlink: Add new "enable_roce" generic device param · 6c7295e1

由 Michael Guralnik 提交于 11月 08, 2019

New device parameter to enable/disable handling of RoCE traffic in the
device.
Signed-off-by: NMichael Guralnik <michaelgur@mellanox.com>
Acked-by: NJiri Pirko <jiri@mellanox.com>
Reviewed-by: NMaor Gottlieb <maorg@mellanox.com>
Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>

6c7295e1

09 11月, 2019 7 次提交

sctp: add SCTP_PEER_ADDR_THLDS_V2 sockopt · d467ac0a

由 Xin Long 提交于 11月 08, 2019

Section 7.2 of rfc7829: "Peer Address Thresholds (SCTP_PEER_ADDR_THLDS)
Socket Option" extends 'struct sctp_paddrthlds' with 'spt_pathcpthld'
added to allow a user to change ps_retrans per sock/asoc/transport, as
other 2 paddrthlds: pf_retrans, pathmaxrxt.

Note: to not break the user's program, here to support pf_retrans dump
and setting by adding a new sockopt SCTP_PEER_ADDR_THLDS_V2, and a new
structure sctp_paddrthlds_v2 instead of extending sctp_paddrthlds.

Also, when setting ps_retrans, the value is not allowed to be greater
than pf_retrans.

v1->v2:
  - use SCTP_PEER_ADDR_THLDS_V2 to set/get pf_retrans instead,
    as Marcelo and David Laight suggested.
Signed-off-by: NXin Long <lucien.xin@gmail.com>
Acked-by: NNeil Horman <nhorman@tuxdriver.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d467ac0a

sctp: add support for Primary Path Switchover · 34515e94

由 Xin Long 提交于 11月 08, 2019

This is a new feature defined in section 5 of rfc7829: "Primary Path
Switchover". By introducing a new tunable parameter:

  Primary.Switchover.Max.Retrans (PSMR)

The primary path will be changed to another active path when the path
error counter on the old primary path exceeds PSMR, so that "the SCTP
sender is allowed to continue data transmission on a new working path
even when the old primary destination address becomes active again".

This patch is to add this tunable parameter, 'ps_retrans' per netns,
sock, asoc and transport. It also allows a user to change ps_retrans
per netns by sysctl, and ps_retrans per sock/asoc/transport will be
initialized with it.

The check will be done in sctp_do_8_2_transport_strike() when this
feature is enabled.

Note this feature is disabled by initializing 'ps_retrans' per netns
as 0xffff by default, and its value can't be less than 'pf_retrans'
when changing by sysctl.

v3->v4:
  - add define SCTP_PS_RETRANS_MAX 0xffff, and use it on extra2 of
    sysctl 'ps_retrans'.
  - add a new entry for ps_retrans on ip-sysctl.txt.
Signed-off-by: NXin Long <lucien.xin@gmail.com>
Acked-by: NNeil Horman <nhorman@tuxdriver.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

34515e94

sctp: add SCTP_EXPOSE_POTENTIALLY_FAILED_STATE sockopt · 8d2a6935

由 Xin Long 提交于 11月 08, 2019

This is a sockopt defined in section 7.3 of rfc7829: "Exposing
the Potentially Failed Path State", by which users can change
pf_expose per sock and asoc.

The new sockopt SCTP_EXPOSE_POTENTIALLY_FAILED_STATE is also
known as SCTP_EXPOSE_PF_STATE for short.

v2->v3:
  - return -EINVAL if params.assoc_value > SCTP_PF_EXPOSE_MAX.
  - define SCTP_EXPOSE_PF_STATE SCTP_EXPOSE_POTENTIALLY_FAILED_STATE.
v3->v4:
  - improve changelog.
Signed-off-by: NXin Long <lucien.xin@gmail.com>
Acked-by: NNeil Horman <nhorman@tuxdriver.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8d2a6935

sctp: add SCTP_ADDR_POTENTIALLY_FAILED notification · 768e1518

由 Xin Long 提交于 11月 08, 2019

SCTP Quick failover draft section 5.1, point 5 has been removed
from rfc7829. Instead, "the sender SHOULD (i) notify the Upper
Layer Protocol (ULP) about this state transition", as said in
section 3.2, point 8.

So this patch is to add SCTP_ADDR_POTENTIALLY_FAILED, defined
in section 7.1, "which is reported if the affected address
becomes PF". Also remove transport cwnd's update when moving
from PF back to ACTIVE , which is no longer in rfc7829 either.

Note that ulp_notify will be set to false if asoc->expose is
not 'enabled', according to last patch.

v2->v3:
  - define SCTP_ADDR_PF SCTP_ADDR_POTENTIALLY_FAILED.
v3->v4:
  - initialize spc_state with SCTP_ADDR_AVAILABLE, as Marcelo suggested.
  - check asoc->pf_expose in sctp_assoc_control_transport(), as Marcelo
    suggested.
Signed-off-by: NXin Long <lucien.xin@gmail.com>
Acked-by: NNeil Horman <nhorman@tuxdriver.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

768e1518

sctp: add pf_expose per netns and sock and asoc · aef587be

由 Xin Long 提交于 11月 08, 2019

As said in rfc7829, section 3, point 12:

  The SCTP stack SHOULD expose the PF state of its destination
  addresses to the ULP as well as provide the means to notify the
  ULP of state transitions of its destination addresses from
  active to PF, and vice versa.  However, it is recommended that
  an SCTP stack implementing SCTP-PF also allows for the ULP to be
  kept ignorant of the PF state of its destinations and the
  associated state transitions, thus allowing for retention of the
  simpler state transition model of [RFC4960] in the ULP.

Not only does it allow to expose the PF state to ULP, but also
allow to ignore sctp-pf to ULP.

So this patch is to add pf_expose per netns, sock and asoc. And in
sctp_assoc_control_transport(), ulp_notify will be set to false if
asoc->expose is not 'enabled' in next patch.

It also allows a user to change pf_expose per netns by sysctl, and
pf_expose per sock and asoc will be initialized with it.

Note that pf_expose also works for SCTP_GET_PEER_ADDR_INFO sockopt,
to not allow a user to query the state of a sctp-pf peer address
when pf_expose is 'disabled', as said in section 7.3.

v1->v2:
  - Fix a build warning noticed by Nathan Chancellor.
v2->v3:
  - set pf_expose to UNUSED by default to keep compatible with old
    applications.
v3->v4:
  - add a new entry for pf_expose on ip-sysctl.txt, as Marcelo suggested.
  - change this patch to 1/5, and move sctp_assoc_control_transport
    change into 2/5, as Marcelo suggested.
  - use SCTP_PF_EXPOSE_UNSET instead of SCTP_PF_EXPOSE_UNUSED, and
    set SCTP_PF_EXPOSE_UNSET to 0 in enum, as Marcelo suggested.
Signed-off-by: NXin Long <lucien.xin@gmail.com>
Acked-by: NNeil Horman <nhorman@tuxdriver.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

aef587be

devlink: disallow reload operation during device cleanup · a0c76345

由 Jiri Pirko 提交于 11月 08, 2019

There is a race between driver code that does setup/cleanup of device
and devlink reload operation that in some drivers works with the same
code. Use after free could we easily obtained by running:

while true; do
        echo 10 > /sys/bus/netdevsim/new_device
        devlink dev reload netdevsim/netdevsim10 &
        echo 10 > /sys/bus/netdevsim/del_device
done

Fix this by enabling reload only after setup of device is complete and
disabling it at the beginning of the cleanup process.
Reported-by: NIdo Schimmel <idosch@mellanox.com>
Fixes: 2d8dc5bb ("devlink: Add support for reload")
Signed-off-by: NJiri Pirko <jiri@mellanox.com>
Acked-by: NJakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a0c76345

packet: fix data-race in fanout_flow_is_huge() · b756ad92

由 Eric Dumazet 提交于 11月 08, 2019

KCSAN reported the following data-race [1]

Adding a couple of READ_ONCE()/WRITE_ONCE() should silence it.

Since the report hinted about multiple cpus using the history
concurrently, I added a test avoiding writing on it if the
victim slot already contains the desired value.

[1]

BUG: KCSAN: data-race in fanout_demux_rollover / fanout_demux_rollover

read to 0xffff8880b01786cc of 4 bytes by task 18921 on cpu 1:
 fanout_flow_is_huge net/packet/af_packet.c:1303 [inline]
 fanout_demux_rollover+0x33e/0x3f0 net/packet/af_packet.c:1353
 packet_rcv_fanout+0x34e/0x490 net/packet/af_packet.c:1453
 deliver_skb net/core/dev.c:1888 [inline]
 dev_queue_xmit_nit+0x15b/0x540 net/core/dev.c:1958
 xmit_one net/core/dev.c:3195 [inline]
 dev_hard_start_xmit+0x3f5/0x430 net/core/dev.c:3215
 __dev_queue_xmit+0x14ab/0x1b40 net/core/dev.c:3792
 dev_queue_xmit+0x21/0x30 net/core/dev.c:3825
 neigh_direct_output+0x1f/0x30 net/core/neighbour.c:1530
 neigh_output include/net/neighbour.h:511 [inline]
 ip6_finish_output2+0x7a2/0xec0 net/ipv6/ip6_output.c:116
 __ip6_finish_output net/ipv6/ip6_output.c:142 [inline]
 __ip6_finish_output+0x2d7/0x330 net/ipv6/ip6_output.c:127
 ip6_finish_output+0x41/0x160 net/ipv6/ip6_output.c:152
 NF_HOOK_COND include/linux/netfilter.h:294 [inline]
 ip6_output+0xf2/0x280 net/ipv6/ip6_output.c:175
 dst_output include/net/dst.h:436 [inline]
 ip6_local_out+0x74/0x90 net/ipv6/output_core.c:179
 ip6_send_skb+0x53/0x110 net/ipv6/ip6_output.c:1795
 udp_v6_send_skb.isra.0+0x3ec/0xa70 net/ipv6/udp.c:1173
 udpv6_sendmsg+0x1906/0x1c20 net/ipv6/udp.c:1471
 inet6_sendmsg+0x6d/0x90 net/ipv6/af_inet6.c:576
 sock_sendmsg_nosec net/socket.c:637 [inline]
 sock_sendmsg+0x9f/0xc0 net/socket.c:657
 ___sys_sendmsg+0x2b7/0x5d0 net/socket.c:2311
 __sys_sendmmsg+0x123/0x350 net/socket.c:2413
 __do_sys_sendmmsg net/socket.c:2442 [inline]
 __se_sys_sendmmsg net/socket.c:2439 [inline]
 __x64_sys_sendmmsg+0x64/0x80 net/socket.c:2439
 do_syscall_64+0xcc/0x370 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

write to 0xffff8880b01786cc of 4 bytes by task 18922 on cpu 0:
 fanout_flow_is_huge net/packet/af_packet.c:1306 [inline]
 fanout_demux_rollover+0x3a4/0x3f0 net/packet/af_packet.c:1353
 packet_rcv_fanout+0x34e/0x490 net/packet/af_packet.c:1453
 deliver_skb net/core/dev.c:1888 [inline]
 dev_queue_xmit_nit+0x15b/0x540 net/core/dev.c:1958
 xmit_one net/core/dev.c:3195 [inline]
 dev_hard_start_xmit+0x3f5/0x430 net/core/dev.c:3215
 __dev_queue_xmit+0x14ab/0x1b40 net/core/dev.c:3792
 dev_queue_xmit+0x21/0x30 net/core/dev.c:3825
 neigh_direct_output+0x1f/0x30 net/core/neighbour.c:1530
 neigh_output include/net/neighbour.h:511 [inline]
 ip6_finish_output2+0x7a2/0xec0 net/ipv6/ip6_output.c:116
 __ip6_finish_output net/ipv6/ip6_output.c:142 [inline]
 __ip6_finish_output+0x2d7/0x330 net/ipv6/ip6_output.c:127
 ip6_finish_output+0x41/0x160 net/ipv6/ip6_output.c:152
 NF_HOOK_COND include/linux/netfilter.h:294 [inline]
 ip6_output+0xf2/0x280 net/ipv6/ip6_output.c:175
 dst_output include/net/dst.h:436 [inline]
 ip6_local_out+0x74/0x90 net/ipv6/output_core.c:179
 ip6_send_skb+0x53/0x110 net/ipv6/ip6_output.c:1795
 udp_v6_send_skb.isra.0+0x3ec/0xa70 net/ipv6/udp.c:1173
 udpv6_sendmsg+0x1906/0x1c20 net/ipv6/udp.c:1471
 inet6_sendmsg+0x6d/0x90 net/ipv6/af_inet6.c:576
 sock_sendmsg_nosec net/socket.c:637 [inline]
 sock_sendmsg+0x9f/0xc0 net/socket.c:657
 ___sys_sendmsg+0x2b7/0x5d0 net/socket.c:2311
 __sys_sendmmsg+0x123/0x350 net/socket.c:2413
 __do_sys_sendmmsg net/socket.c:2442 [inline]
 __se_sys_sendmmsg net/socket.c:2439 [inline]
 __x64_sys_sendmmsg+0x64/0x80 net/socket.c:2439
 do_syscall_64+0xcc/0x370 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Reported by Kernel Concurrency Sanitizer on:
CPU: 0 PID: 18922 Comm: syz-executor.3 Not tainted 5.4.0-rc6+ #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011

Fixes: 3b3a5b0a ("packet: rollover huge flows before small flows")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b756ad92

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功