提交 · 513334e18a74f70c0be58c2eb73af1715325b870 · openeuler / Kernel

03 7月, 2016 17 次提交

由 David S. Miller 提交于 7月 02, 2016

Saeed Mahameed says:

====================
Mellanox 100G SRIOV E-Switch offload and VF representors

We are happy to announce SRIOV E-Switch offload and VF netdev representors.

Or Gerlitz says:

Currently, the way SR-IOV embedded switches are dealt with in Linux is limited
in its expressiveness and flexibility, but this is not necessarily due to
hardware limitations. The kernel software model for controlling the SR-IOV
switch simply does not allow the configuration of anything more complex than
MAC/VLAN based forwarding.

Hence the benefits brought by SRIOV come at a price of management flexibility,
when compared to software virtual switches which are used in Para-Virtual (PV)
schemes and allow implementing complex policies and virtual topologies. Such
SW switching typically involved a complex per-packet processing within the host
kernel using subsystems such as TC, Bridge, Netfilter and Open-vswitch.

We'd like to change that and get the best of both worlds: the performance of SR-IOV
with the management flexibility of software switches. This will eventually include
a richer model for controlling the SR-IOV switch for flow-based switching and
tunneling. Under this model, the e-switch is configured dynamically and a fallback
to software exists in case the hardware is unable to offload all required flows.

This series from Hadar Hen-Zion and myself, is the 1st step in that direction,
specfically, it provides full control on the SRIOV embedded switching by host
software and paves the way to offload switching rules and polices with downstream
patches.

To allow for host based SW control on the SRIOV HW switch, we introduce per VF
representor host netdevice. The VF representor plays the same role as TAP devices
in PV setup. A packet send through the VF representor on the host arrives to
the VF, and a packet sent through the VF is received by its representor. The
administrator can hook the representor netdev into a kernel switching component.
Once they do that, packets from the VF are subject to steering (matching and
actions) of that software component."

Doing so indeed hurts the performance benefits of SRIOV as it forces all the
traffic to go through the hypervisor. However, this SW representation is what
would eventually allow us to introduce hybrid model, where we offload steering
for some of the VF/VM traffic to the HW while keeping other VM traffic to go
through the hypervisor. Examples for the latter are first packet of flows which
are needed for SW switches learning and/or matching against policy database or
types of traffic for which offloading is not desired or not supported by the
current HW eswitch generation.

The embedded switch is managed through a PCI device driver. As such, we introduce
a devlink/pci based scheme for setting the mode of the e-switch. The current mode
(where steering is done based on mac/vlan, etc) is referred to as "legacy" and the
new mode as "offloads".

For the mlx5 driver / ConnectX4 HW case, the VF representors implement a functional
subset of mlx5e Ethernet netdevices using their own profile. This design buys us robust
implementation with code reuse and sharing.

The representors are created by the host PCI driver when (1) in SRIOV and (2) the
e-switch is set to offloads mode. Currently, in mlx5 the e-switch management is done
through the PF vport (0) and hence the VF representors along with the existing PF
netdev which represents the uplink share the PCI PF device instance.

The series is built from two major components, the first relates to the e-switch
management and the second to VF representors.

We start with a refactoring that treats the existing SRIOV e-switch code as of operating
in legacy mode. Next, we add the code for the offloads mode which programs the e-switch
to operate in a way which serves for software based switching:

1. miss rule which matches all packets that do not match any HW other switching rule
and forwards them to the e-switch management port (0) for further processing.

2. infrastructure for send-to-vport rules which conceptually bypass other "normal"
steering rules which present at the e-switch datapath. Such rules apply only for packets
that originate in the e-switch manager vport (0).

Since all the VF reps run over the same e-switch port, we use more logic in the host PCI
driver to do HW steering of missed packets into the HW queue opened by a the respective VF
representor. Finally here, we add the devlink APIs to configure the e-switch mode.

The second part from Hadar starts with some refactoring work which allow for multiple
mlx5e NIC instances to be created over the same PCI function, use common resources
and avoid wrong loopbacks.

Next comes the heart of the change which is a profile definition which allow to practically
have both "conventional" mlx5e NIC use cases such as native mode (non SRIOV), VF, PF and VF
representor to share the Ethernet driver code. This is done by a small surgery that ended up
with few internal callbacks that should be implemented by a profile instance. The profile
for the conventional NIC is implemented, to preserve the existing functionality.

The last two patches add e-switch registration API for the VF representors and the
implementation of the VF representors netdevice profile. Being an mlx5e instance, the
VF representor uses HW send/recv queues, completions queues and such. It currently doesn't
support NIC offloads but some of them could be added later on. The VF representor has
switchdev ops, where currently the only supported API is the one to the HW ID,
which is needed to identify multiple representors belonging to the same e-switch.

The architecture + solution (software and firmware) work were done by a team consisting
of Ilya Lesokhin, Haggai Eran, Rony Efraim, Tal Anker, Natan Oppenheimer, Saeed Mahameed,
Hadar and Or, thanks you all!

v1 --> v2 fixes:
* removed unneeded variable (patch #3)
* removed unused value DEVLINK_ESWITCH_MODE_NONE (patch #8)
* changed the devlink mode name from "offloads" to "switchdev" which
better describes what are we referring here, using a known concept (patch #8)
* correctly refer to devlink e-switch modes (patch #10)
* use the correct mlx5e way to define the VF rep statistics (patch #16)

v2 --> v3 fixes:
* Rebased on top 6fde0e63 'be2net: signedness bug in be_msix_enable()'
* Handled compilation error introduced by rebase on top "f5074d0c Merge branch 'mlx5-100G-fixes'"
* This series applies perfectly even with 'mlx5 resiliency and xmit path fixes' merged to net-next
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

513334e1

net/mlx5e: Introduce SRIOV VF representors · cb67b832

由 Hadar Hen Zion 提交于 7月 01, 2016

Implement the relevant profile functions to create mlx5e driver instance
serving as VF representor. When SRIOV offloads mode is enabled, each VF
will have a representor netdevice instance on the host.

To do that, we also export set of shared service functions from en_main.c,
such that they can be used by both NIC and repsresentors netdevs.

The newly created representor netdevice has a basic set of net_device_ops
which are the same ndo functions as the NIC netdevice and an ndo of it's
own for phys port name.

The profiling infrastructure allow sharing code between the NIC and the
vport representor even though the representor has only a subset of the
NIC functionality.

The VF reps and the PF which is used in that mode to represent the uplink,
expose switchdev ops. Currently the only op supposed is attr get for the
port parent ID which here serves to identify net-devices belonging to the
same HW E-Switch. Other than that, no offloading is implemented and hence
switching functionality is achieved if one sets SW switching rules, e.g
using tc, bridge or ovs.

Port phys name (ndo_get_phys_port_name) is implemented to allow exporting
to user-space the VF vport number and along with the switchdev port parent
id (phys_switch_id) enable a udev base consistent naming scheme:

SUBSYSTEM=="net", ACTION=="add", ATTR{phys_switch_id}=="<phys_switch_id>", \
ATTR{phys_port_name}!="", NAME="$PF_NIC$attr{phys_port_name}"

where phys_switch_id is exposed by the PF (and VF reps) and $PF_NIC is
the name of the PF netdevice.
Signed-off-by: NHadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

cb67b832

net/mlx5: Add Representors registration API · 127ea380

由 Hadar Hen Zion 提交于 7月 01, 2016

Introduce E-Switch registration/unregister representors functions.

Those functions are called by the mlx5e driver when the PF NIC is
created upon pci probe action regardless of the E-Switch mode (NONE,
LEGACY or OFFLOADS).

Adding basic E-Switch database that will hold the vport represntors
upon creation.

This patch doesn't add any new functionality.
Signed-off-by: NHadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

127ea380

net/mlx5e: Add support for multiple profiles · 6bfd390b

由 Hadar Hen Zion 提交于 7月 01, 2016

To allow support in representor netdevices where we create more than one
netdevice per NIC, add profiles to the mlx5e driver. The profiling
allows for creation of mlx5e instances with different characteristics.

Each profile implements its own behavior using set of function pointers
defined in struct mlx5e_profile. This is done to allow for avoiding complex
per profix branching in the code.

Currently only the profile for the conventional NIC is implemented,
which is of use when a netdev is created upon pci probe.

This patch doesn't add any new functionality.
Signed-off-by: NHadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6bfd390b

net/mlx5e: Mark enabled RQTs instances explicitly · 398f3351

由 Hadar Hen Zion 提交于 7月 01, 2016

In the current driver implementation two types of receive queue
tables (RQTs) are in use - direct and indirect.

Change the driver to mark each new created RQT (direct or indirect)
as "enabled". This behaviour is needed for introducing new mlx5e
instances which serve to represent SRIOV VFs.

The VF representors will have only one type of RQTs (direct).

An "enabled" flag is added to each RQT to allow better handling
and code sharing between the representors and the nic netdevices.

This patch doesn't add any new functionality.
Signed-off-by: NHadar Hen Zion <hadarh@mellanox.com>
Reviewed-by: NOr Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

398f3351

net/mlx5e: TIRs management refactoring · 724b2aa1

由 Hadar Hen Zion 提交于 7月 01, 2016

The current refresh tirs self loopback mechanism, refreshes all the tirs
belonging to the same mlx5e instance to prevent self loopback by packets
sent over any ring of that instance. This mechanism relies on all the
tirs/tises of an instance to be created with the same transport domain
number (tdn).

Change the driver to refresh all the tirs created under the same tdn
regardless of which mlx5e netdev instance they belong to.

This behaviour is needed for introducing new mlx5e instances which serve
to represent SRIOV VFs. The representors and the PF share vport used for
E-Switch management, and we want to avoid NIC level HW loopback between
them, e.g when sending broadcast packets. To achieve that, both the
representors and the PF NIC will share the tdn.

This patch doesn't add any new functionality.
Signed-off-by: NHadar Hen Zion <hadarh@mellanox.com>
Reviewed-by: NOr Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

724b2aa1

net/mlx5e: Create NIC global resources only once · b50d292b

由 Hadar Hen Zion 提交于 7月 01, 2016

To allow creating more than one netdev over the same PCI function, we
change the driver such that global NIC resources are created once and
later be shared amongst all the mlx5e netdevs running over that port.

Move the CQ UAR, PD (pdn), Transport Domain (tdn), MKey resources from
being kept in the mlx5e priv part to a new resources structure
(mlx5e_resources) placed under the mlx5_core device.

This patch doesn't add any new functionality.
Signed-off-by: NHadar Hen Zion <hadarh@mellanox.com>
Reviewed-by: NOr Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b50d292b

net/mlx5e: Add devlink based SRIOV mode changes · c930a3ad

由 Or Gerlitz 提交于 7月 01, 2016

Implement handlers for the devlink commands to get and set the SRIOV
E-Switch mode.

When turning to the switchdev/offloads mode, we disable the e-switch
and enable it again in the new mode, create the NIC offloads table
and create VF reps.

When turning to legacy mode, we remove the VF reps and the offloads
table, and re-initiate the e-switch in it's legacy mode.

The actual creation/removal of the VF reps is done in downstream patches.
Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c930a3ad

net/mlx5: Add devlink interface · feae9087

由 Or Gerlitz 提交于 7月 01, 2016

The devlink interface is initially used to set/get the mode of the SRIOV e-switch.

Currently, these are only stubs for get/set, down-stream patch will actually
fill them out.
Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

feae9087

net/devlink: Add E-Switch mode control · 08f4b591

由 Or Gerlitz 提交于 7月 01, 2016

Add the commands to set and show the mode of SRIOV E-Switch, two modes
are supported:

* legacy: operating in the "old" L2 based mode (DMAC --> VF vport)

* switchdev: the E-Switch is referred to as whitebox switch configured
using standard tools such as tc, bridge, openvswitch etc. To allow
working with the tools, for each VF, a VF representor netdevice is
created by the E-Switch manager vendor device driver instance (e.g PF).
Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

08f4b591

net/mlx5: E-Switch, Add API to create vport rx rules · fed9ce22

由 Or Gerlitz 提交于 7月 01, 2016

Add the API to create vport rx rules of the form

	packet meta-data :: vport == $VPORT --> $TIR

where the TIR is opened by this VF representor.

This logic will by used for packets that didn't match any rule in the
e-switch datapath and should be received into the host OS through the
netdevice that represents the VF they were sent from.
Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fed9ce22

net/mlx5: E-Switch, Add offloads table · c116c6ee

由 Or Gerlitz 提交于 7月 01, 2016

Belongs to the NIC offloads name-space, and to be used as part of the
SRIOV offloads logic to steer packets that hit the e-switch miss rule
to the TIR of the relevant VF representor.
Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c116c6ee

net/mlx5: Introduce offloads steering namespace · acbc2004

由 Or Gerlitz 提交于 7月 01, 2016

Add a new namespace (MLX5_FLOW_NAMESPACE_OFFLOADS) to be populated
with flow steering rules that deal with rules that have have to
be executed before the EN NIC steering rules are matched.

The namespace is located after the bypass name-space and before the
kernel name-space. Therefore, it precedes the HW processing done for
rules set for the kernel NIC name-space.

Under SRIOV, it would allow us to match on e-switch missed packet
and forward them to the relevant VF representor TIR.
Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: NAmir Vadai <amir@vadai.me>
Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

acbc2004

net/mlx5: E-Switch, Add API to create send-to-vport rules · ab22be9b

由 Or Gerlitz 提交于 7月 01, 2016

Add the API to create send-to-vport e-switch rules of the form

packet meta-data :: send-queue-number == $SQN and source-vport == 0 --> $VPORT

These rules are to be used for a send-to-vport logic which conceptually bypasses
the "normal" steering rules currently present at the e-switch datapath.

Such rule should apply only for packets that originate in the e-switch manager
vport (0) and are sent for a given SQN which is used by a given VF representor
device, and hence the matching logic.
Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ab22be9b

net/mlx5: E-Switch, Add miss rule for offloads mode · 3aa33572

由 Or Gerlitz 提交于 7月 01, 2016

In the sriov offloads mode, packets that are not matched by any other
rule should be sent towards the e-switch manager for further processing.

Add such "miss" rule which matches ANY packet as the last rule in the
e-switch FDB and programs the HW to send the packet to vport 0 where
the e-switch manager runs.
Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3aa33572

net/mlx5: E-Switch, Add support for the sriov offloads mode · 69697b6e

由 Or Gerlitz 提交于 7月 01, 2016

Unlike the legacy mode, here, forwarding rules are not learned by the
driver per events on macs set by VFs/VMs into their vports, but rather
should be programmed by higher-level SW entities.

Saying that, still, in the offloads mode (SRIOV_OFFLOADS), two flow
groups are created by the driver for management (slow path) purposes:

The first group will be used for sending packets over e-switch vports
from the host OS where the e-switch management code runs, to be
received by VFs.

The second group will be used by a miss rule which forwards packets toward
the e-switch manager. Further logic will trap these packets such that
the receiving net-device as seen by the networking stack is the representor
of the vport that sent the packet over the e-switch data-path.
Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

69697b6e

net/mlx5: E-Switch, Add operational mode to the SRIOV e-Switch · 6ab36e35

由 Or Gerlitz 提交于 7月 01, 2016

Define three modes for the SRIOV e-switch operation, none (SRIOV_NONE,
none of the VF vports are enabled), legacy (SRIOV_LEGACY, the current mode)
and sriov offloads (SRIOV_OFFLOADS). Currently, when in SRIOV, only the
legacy mode is supported, where steering rules are of the form:

        destination mac --> VF vport

This patch does not change any functionality.
Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6ab36e35

02 7月, 2016 23 次提交

Merge tag 'batadv-next-for-davem-20160701' of git://git.open-mesh.org/linux-merge · 3ea00443

由 David S. Miller 提交于 7月 01, 2016

Simon Wunderlich says:

====================
This feature patchset includes the following changes:

 - two patches with minimal clean up work by Antonio Quartulli and
   Simon Wunderlich

 - eight patches of B.A.T.M.A.N. V, API and documentation clean
   up work, by Antonio Quartulli and Marek Lindner

 - Andrew Lunn fixed the skb priority adoption when forwarding
   fragmented packets (two patches)

 - Multicast optimization support is now enabled for bridges which
   comes with some protocol updates, by Linus Luessing
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3ea00443

Merge branch 'hns-next' · ca9354a1

由 David S. Miller 提交于 7月 01, 2016

Yisen Zhuang says:

====================
net: hns: fix the typo of hns

This series includes typo fixes which review by Andy, adding
the hns maintainer to MAINTAINERS, as below:

 > from Daode: adds the maintainer for hns driver;

 > from Daode: fix the typo of hns reviewed by Andy Shevchenko;

 > from Kejian: one remove redundant function and two fix to get
configuration from DT.

changlog:
 v2 -> v3:
  match all files in and below drivers/net/ethernet/hisilicon/

 v1 -> v2:
  fix the indentations reviewed by David.
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ca9354a1

net: hns: get reset registers from DT · b15dc292

由 Kejian Yan 提交于 7月 01, 2016

Since the registers of subctrl may be different, it is better to
mv the registers from hns mdio driver routine to device tree node.
Signed-off-by: NKejian Yan <yankejian@huawei.com>
Signed-off-by: NYisen Zhuang <Yisen.Zhuang@huawei.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b15dc292

net: hns: add media-type property for hns · 5d2525f7

由 Kejian Yan 提交于 7月 01, 2016

It is PORT_TP type if the service port is GE mode. It is wrong to
judge the port type by using if it is service port. Adding the media
type to know port type.
Reported-by: NJinchuan Tian <tianjinchuan1@huawei.com>
Signed-off-by: NKejian Yan <yankejian@huawei.com>
Signed-off-by: NYisen Zhuang <Yisen.Zhuang@huawei.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5d2525f7

net: hns: remove redundant hns_mac_dev_to_enet_if() · 2e14b218

由 Kejian Yan 提交于 7月 01, 2016

The sequence of hns_mac_dev_to_enet_if() is the same as
hns_get_enet_interface(), and hns_get_enet_interface() is called
by initialization to get the mac mode. And the mode is not changed
anywhere. Thus add hns_mac_dev_to_enet_if() function to get the mac
mode is obviously redundant.
Reported-by: NJinchuan Tian <tianjinchuan1@huawei.com>
Signed-off-by: NKejian Yan <yankejian@huawei.com>
Signed-off-by: NYisen Zhuang <Yisen.Zhuang@huawei.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2e14b218

net: hns: normalize two different loop · 45fc764e

由 Daode Huang 提交于 7月 01, 2016

There are two approaches to assign data, one does 2 loops, another
does 1 loop. This patch normalize the different methods to 1 loop.
Signed-off-by: NDaode Huang <huangdaode@hisilicon.com>
Signed-off-by: NYisen Zhuang <Yisen.Zhuang@huawei.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

45fc764e

net: hns: add a space before "*/" · 6ba312eb

由 Daode Huang 提交于 7月 01, 2016

In comment line, some time miss a space before */, so this
patch adds a space before */.
Signed-off-by: NDaode Huang <huangdaode@hisilicon.com>
Signed-off-by: NYisen Zhuang <Yisen.Zhuang@huawei.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6ba312eb

net: hns: delete redundant parenthese · 68fa1636

由 Daode Huang 提交于 7月 01, 2016

According to the previous review comments from Andy, this patch
deletes the redundant parens in the patch.
Signed-off-by: NDaode Huang <huangdaode@hisilicon.com>
Signed-off-by: NYisen Zhuang <Yisen.Zhuang@huawei.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

68fa1636

net: hns: change code style from a = a + x to a += x · 8ec98ba7

由 Daode Huang 提交于 7月 01, 2016

This patch fixes the code style in hns driver. Change it from
"buff = buff + xxx" to "buff += xxx". The reveiw comments is
from andy.
Reviewed-by: NAndriy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: NDaode Huang <huangdaode@hisilicon.com>
Signed-off-by: NYisen Zhuang <Yisen.Zhuang@huawei.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8ec98ba7

net: hns: fix code style about hns driver · d9fdb4ed

由 Daode Huang 提交于 7月 01, 2016

This patch fixes code sytle of hns driver to make it
simple.
Signed-off-by: NDaode Huang <huangdaode@hisilicon.com>
Signed-off-by: NYisen Zhuang <Yisen.Zhuang@huawei.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d9fdb4ed

MAINTAINERS: add maintainers for hns driver · b30d74e4

由 Daode Huang 提交于 7月 01, 2016

This patch adds maintainers for hisilicon network subsystem driver
Signed-off-by: NDaode Huang <huangdaode@hisilicon.com>
Signed-off-by: NYisen Zhuang <Yisen.Zhuang@huawei.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b30d74e4

Merge branch 'rds-multipath-datastructures' · 1364db42

由 David S. Miller 提交于 7月 01, 2016

Sowmini Varadhan says:

====================
RDS:TCP data structure changes for multipath support

The second installment of changes to enable multipath support in
RDS-TCP. This series implements the changes in rds-tcp so that the
rds_conn_path has a pointer to the rds_tcp_connection in cp_transport_data.
Struct rds_tcp_connection keeps track of the inet_sk per path in
t_sock. The ->sk_user_data in turn is a pointer to the rds_conn_path.
With this set of changes, rds_tcp has the needed plumbing to handle
multiple paths(socket) per rds_connection.
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1364db42

RDS: Do not send a pong to an incoming ping with 0 src port · 11bb62f7

由 Sowmini Varadhan 提交于 6月 30, 2016

RDS ping messages are sent with a non-zero src port to a zero
dst port, so that the rds pong messages can be sent back to the
originators src port. However if a confused/malicious sender
sends a ping with a 0 src port, we'd have an infinite ping-pong
loop. To avoid this, the receiver should ignore ping messages
with a 0 src port.
Acked-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: NSowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

11bb62f7

RDS: TCP: Simplify reconnect to avoid duelling reconnnect attempts · 8315011a

由 Sowmini Varadhan 提交于 6月 30, 2016

When reconnecting, the peer with the smaller IP address will initiate
the reconnect, to avoid needless duelling SYN issues.
Acked-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: NSowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8315011a

RDS: TCP: Hooks to set up a single connection path · b04e8554

由 Sowmini Varadhan 提交于 6月 30, 2016

This patch adds ->conn_path_connect callbacks in the rds_transport
that are used to set up a single connection path.
Acked-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: NSowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b04e8554

RDS: TCP: make receive path use the rds_conn_path · 2da43c4a

由 Sowmini Varadhan 提交于 6月 30, 2016

The ->sk_user_data contains a pointer to the rds_conn_path
for the socket. Use this consistently in the rds_tcp_data_ready
callbacks to get the rds_conn_path for rds_recv_incoming.
Acked-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: NSowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2da43c4a

RDS: TCP: make ->sk_user_data point to a rds_conn_path · ea3b1ea5

由 Sowmini Varadhan 提交于 6月 30, 2016

The socket callbacks should all operate on a struct rds_conn_path,
in preparation for a MP capable RDS-TCP.
Acked-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: NSowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ea3b1ea5

RDS: TCP: Refactor connection destruction to handle multiple paths · afb4164d

由 Sowmini Varadhan 提交于 6月 30, 2016

A single rds_connection may have multiple rds_conn_paths that have
to be carefully and correctly destroyed, for both rmmod and
netns-delete cases.

For both cases, we extract a single rds_tcp_connection for
each conn into a temporary list, and then invoke rds_conn_destroy()
which iteratively dismantles every path in the rds_connection.

For the netns deletion case, we additionally have to make sure
that we do not leave a socket in TIME_WAIT state, as this will
hold up the netns deletion. Thus we call rds_tcp_conn_paths_destroy()
to reset state quickly.
Acked-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: NSowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

afb4164d

RDS: TCP: Make rds_tcp_connection track the rds_conn_path · 02105b2c

由 Sowmini Varadhan 提交于 6月 30, 2016

The struct rds_tcp_connection is the transport-specific private
data structure that tracks TCP information per rds_conn_path.
Modify this structure to have a back-pointer to the rds_conn_path
for which it is the ->cp_transport_data.
Acked-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: NSowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

02105b2c

RDS: TCP: Remove dead logic around c_passive in rds-tcp · 26e4e6bb

由 Sowmini Varadhan 提交于 6月 30, 2016

The c_passive bit is only intended for the IB transport and will
never be encountered in rds-tcp, so remove the dead logic that
predicates on this bit.
Acked-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: NSowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

26e4e6bb

RDS: Rework path specific indirections · 226f7a7d

由 Sowmini Varadhan 提交于 6月 30, 2016

Refactor code to avoid separate indirections for single-path
and multipath transports. All transports (both single and mp-capable)
will get a pointer to the rds_conn_path, and can trivially derive
the rds_connection from the ->cp_conn.
Acked-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: NSowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

226f7a7d

Merge branch 'bpf-cgroup2' · dc9a2002

由 David S. Miller 提交于 7月 01, 2016

Martin KaFai Lau says:

====================
cgroup: bpf: cgroup2 membership test on skb

This series is to implement a bpf-way to
check the cgroup2 membership of a skb (sk_buff).

It is similar to the feature added in netfilter:
c38c4597 ("netfilter: implement xt_cgroup cgroup2 path match")

The current target is the tc-like usage.

v3:
- Remove WARN_ON_ONCE(!rcu_read_lock_held())
- Stop BPF_MAP_TYPE_CGROUP_ARRAY usage in patch 2/4
- Avoid mounting bpf fs manually in patch 4/4

- Thanks for Daniel's review and the above suggestions

- Check CONFIG_SOCK_CGROUP_DATA instead of CONFIG_CGROUPS.  Thanks to
  the kbuild bot's report.
  Patch 2/4 only needs CONFIG_CGROUPS while patch 3/4 needs
  CONFIG_SOCK_CGROUP_DATA.  Since a single bpf cgrp2 array alone is
  not useful for now, CONFIG_SOCK_CGROUP_DATA is also used in
  patch 2/4.  We can fine tune it later if we find other use cases
  for the cgrp2 array.
- Return EAGAIN instead of ENOENT if the cgrp2 array entry is
  NULL.  It is to distinguish these two cases: 1) the userland has
  not populated this array entry yet. or 2) not finding cgrp2 from the skb.

- Be-lated thanks to Alexei and Tejun on reviewing v1 and giving advice on
  this work.

v2:
- Fix two return cases in cgroup_get_from_fd()
- Fix compilation errors when CONFIG_CGROUPS is not used:
  - arraymap.c: avoid registering BPF_MAP_TYPE_CGROUP_ARRAY
  - filter.c: tc_cls_act_func_proto() returns NULL on BPF_FUNC_skb_in_cgroup
- Add comments to BPF_FUNC_skb_in_cgroup and cgroup_get_from_fd()
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

dc9a2002

cgroup: bpf: Add an example to do cgroup checking in BPF · a3f74617

由 Martin KaFai Lau 提交于 6月 30, 2016

test_cgrp2_array_pin.c:
A userland program that creates a bpf_map (BPF_MAP_TYPE_GROUP_ARRAY),
pouplates/updates it with a cgroup2's backed fd and pins it to a
bpf-fs's file.  The pinned file can be loaded by tc and then used
by the bpf prog later.  This program can also update an existing pinned
array and it could be useful for debugging/testing purpose.

test_cgrp2_tc_kern.c:
A bpf prog which should be loaded by tc.  It is to demonstrate
the usage of bpf_skb_in_cgroup.

test_cgrp2_tc.sh:
A script that glues the test_cgrp2_array_pin.c and
test_cgrp2_tc_kern.c together.  The idea is like:
1. Load the test_cgrp2_tc_kern.o by tc
2. Use test_cgrp2_array_pin.c to populate a BPF_MAP_TYPE_CGROUP_ARRAY
   with a cgroup fd
3. Do a 'ping -6 ff02::1%ve' to ensure the packet has been
   dropped because of a match on the cgroup

Most of the lines in test_cgrp2_tc.sh is the boilerplate
to setup the cgroup/bpf-fs/net-devices/netns...etc.  It is
not bulletproof on errors but should work well enough and
give enough debug info if things did not go well.
Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
Cc: Alexei Starovoitov <ast@fb.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Tejun Heo <tj@kernel.org>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a3f74617

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功