提交 · 084e2f6566d2a39c007ed6473f58b551a2eeefeb · openanolis / cloud-kernel

02 3月, 2016 40 次提交

Support to encoding decoding skb mark on IFE action · 084e2f65

由 Jamal Hadi Salim 提交于 2月 27, 2016

Example usage:
Set the skb using skbedit then allow it to be encoded

sudo tc qdisc add dev $ETH root handle 1: prio
sudo tc filter add dev $ETH parent 1: protocol ip prio 10 \
u32 match ip protocol 1 0xff flowid 1:2 \
action skbedit mark 17 \
action ife encode \
allow mark \
dst 02:15:15:15:15:15

Note: You dont need the skbedit action if you are already encoding the
skb mark earlier. A zero skb mark, when seen, will not be encoded.

Alternative hard code static mark of 0x12 every time the filter matches

sudo $TC filter add dev $ETH parent 1: protocol ip prio 10 \
u32 match ip protocol 1 0xff flowid 1:2 \
action ife encode \
type 0xDEAD \
use mark 0x12 \
dst 02:15:15:15:15:15
Signed-off-by: NJamal Hadi Salim <jhs@mojatatu.com>
Acked-by: NCong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

084e2f65

introduce IFE action · ef6980b6

由 Jamal Hadi Salim 提交于 2月 27, 2016

This action allows for a sending side to encapsulate arbitrary metadata
which is decapsulated by the receiving end.
The sender runs in encoding mode and the receiver in decode mode.
Both sender and receiver must specify the same ethertype.
At some point we hope to have a registered ethertype and we'll
then provide a default so the user doesnt have to specify it.
For now we enforce the user specify it.

Lets show example usage where we encode icmp from a sender towards
a receiver with an skbmark of 17; both sender and receiver use
ethertype of 0xdead to interop.

YYYY: Lets start with Receiver-side policy config:
xxx: add an ingress qdisc
sudo tc qdisc add dev $ETH ingress

xxx: any packets with ethertype 0xdead will be subjected to ife decoding
xxx: we then restart the classification so we can match on icmp at prio 3
sudo $TC filter add dev $ETH parent ffff: prio 2 protocol 0xdead \
u32 match u32 0 0 flowid 1:1 \
action ife decode reclassify

xxx: on restarting the classification from above if it was an icmp
xxx: packet, then match it here and continue to the next rule at prio 4
xxx: which will match based on skb mark of 17
sudo tc filter add dev $ETH parent ffff: prio 3 protocol ip \
u32 match ip protocol 1 0xff flowid 1:1 \
action continue

xxx: match on skbmark of 0x11 (decimal 17) and accept
sudo tc filter add dev $ETH parent ffff: prio 4 protocol ip \
handle 0x11 fw flowid 1:1 \
action ok

xxx: Lets show the decoding policy
sudo tc -s filter ls dev $ETH parent ffff: protocol 0xdead
xxx:
filter pref 2 u32
filter pref 2 u32 fh 800: ht divisor 1
filter pref 2 u32 fh 800::800 order 2048 key ht 800 bkt 0 flowid 1:1  (rule hit 0 success 0)
  match 00000000/00000000 at 0 (success 0 )
        action order 1: ife decode action reclassify
         index 1 ref 1 bind 1 installed 14 sec used 14 sec
         type: 0x0
         Metadata: allow mark allow hash allow prio allow qmap
        Action statistics:
        Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
        backlog 0b 0p requeues 0
xxx:
Observe that above lists all metadatum it can decode. Typically these
submodules will already be compiled into a monolithic kernel or
loaded as modules

YYYY: Lets show the sender side now ..

xxx: Add an egress qdisc on the sender netdev
sudo tc qdisc add dev $ETH root handle 1: prio
xxx:
xxx: Match all icmp packets to 192.168.122.237/24, then
xxx: tag the packet with skb mark of decimal 17, then
xxx: Encode it with:
xxx:	ethertype 0xdead
xxx:	add skb->mark to whitelist of metadatum to send
xxx:	rewrite target dst MAC address to 02:15:15:15:15:15
xxx:
sudo $TC filter add dev $ETH parent 1: protocol ip prio 10  u32 \
match ip dst 192.168.122.237/24 \
match ip protocol 1 0xff \
flowid 1:2 \
action skbedit mark 17 \
action ife encode \
type 0xDEAD \
allow mark \
dst 02:15:15:15:15:15

xxx: Lets show the encoding policy
sudo tc -s filter ls dev $ETH parent 1: protocol ip
xxx:
filter pref 10 u32
filter pref 10 u32 fh 800: ht divisor 1
filter pref 10 u32 fh 800::800 order 2048 key ht 800 bkt 0 flowid 1:2  (rule hit 0 success 0)
  match c0a87aed/ffffffff at 16 (success 0 )
  match 00010000/00ff0000 at 8 (success 0 )

	action order 1:  skbedit mark 17
	 index 6 ref 1 bind 1
 	Action statistics:
	Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
	backlog 0b 0p requeues 0

	action order 2: ife encode action pipe
	 index 3 ref 1 bind 1
	 dst MAC: 02:15:15:15:15:15 type: 0xDEAD
 	 Metadata: allow mark
 	Action statistics:
	Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
	backlog 0b 0p requeues 0
xxx:

test by sending ping from sender to destination
Signed-off-by: NJamal Hadi Salim <jhs@mojatatu.com>
Acked-by: NCong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ef6980b6

Merge tag 'mac80211-next-for-davem-2016-02-26' of... · d67703fc

由 David S. Miller 提交于 3月 01, 2016

Merge tag 'mac80211-next-for-davem-2016-02-26' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next

Johannes Berg says:

====================
Here's another round of updates for -next:
 * big A-MSDU RX performance improvement (avoid linearize of paged RX)
 * rfkill changes: cleanups, documentation, platform properties
 * basic PBSS support in cfg80211
 * MU-MIMO action frame processing support
 * BlockAck reordering & duplicate detection offload support
 * various cleanups & little fixes
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d67703fc

Merge branch 'bridge-mcast-tmp-router-port' · 4ec62070

由 David S. Miller 提交于 3月 01, 2016

Nikolay Aleksandrov says:

====================
bridge: mcast: add support for temp router port

This set adds support for temporary router port which doesn't depend only
on the incoming queries. It can be refreshed by setting multicast_router to
the same value (3). The first two patches are minor changes that prepare
the code for the third which adds this new type of router port.
In order to be able to dump its information the mdb router port format
is changed in patch 04 and extended similar to how mdb entries format was
done recently.
The related iproute2 changes will be posted if this is accepted.

v2: set val first and adjust router type later in patch 01, patch 03 was
split in 2
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4ec62070

bridge: mcast: add support for more router port information dumping · 59f78f9f

由 Nikolay Aleksandrov 提交于 2月 26, 2016

Allow for more multicast router port information to be dumped such as
timer and type attributes. For that that purpose we need to extend the
MDBA_ROUTER_PORT attribute similar to how it was done for the mdb entries
recently. The new format is thus:
[MDBA_ROUTER_PORT] = { <- nested attribute
    u32 ifindex <- router port ifindex for user-space compatibility
    [MDBA_ROUTER_PATTR attributes]
}
This way it remains compatible with older users (they'll simply retrieve
the u32 in the beginning) and new users can parse the remaining
attributes. It would also allow to add future extensions to the router
port without breaking compatibility.
Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

59f78f9f

bridge: mcast: add support for temporary port router · a55d8246

由 Nikolay Aleksandrov 提交于 2月 26, 2016

Add support for a temporary router port which doesn't depend only on the
incoming query. It can be refreshed if set to the same value, which is
a no-op for the rest.
Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a55d8246

bridge: mcast: do nothing if port's multicast_router is set to the same val · 4950cfd1

由 Nikolay Aleksandrov 提交于 2月 26, 2016

This is needed for the upcoming temporary port router. There's no point
to go through the logic if the value is the same.
Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4950cfd1

bridge: mcast: use names for the different multicast_router types · 7f0aec7a

由 Nikolay Aleksandrov 提交于 2月 26, 2016

Using raw values makes it difficult to extend and also understand the
code, give them names and do explicit per-option manipulation in
br_multicast_set_port_router.
Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7f0aec7a

Merge branch 'mv88e6xxx-vlan-filtering' · ec1606c0

由 David S. Miller 提交于 3月 01, 2016

Vivien Didelot says:

====================
net: dsa: mv88e6xxx: implement VLAN filtering

This patchset fixes hardware bridging for non 802.1Q aware systems.

The mv88e6xxx DSA driver currently depends on CONFIG_VLAN_8021Q and
CONFIG_BRIDGE_VLAN_FILTERING enabled for correct bridging between switch ports.

Patch 1/9 adds support for the VLAN filtering switchdev attribute in DSA.

Patchs 2/9 and 3/9 add helper functions for the following patches.

Patchs 4/9 to 6/9 assign dynamic address databases to VLANs, ports, and
bridge groups (the lowest available FID is cleared and assigned), and thus
restore support for per-port FDB operations.

Patchs 7/9 to 9/9 refine ports isolation and setup 802.1Q on user demand.

With this patchset, ports get correctly bridged and the driver behaves as
expected, with or without 802.1Q support.

With CONFIG_VLAN_8021Q enabled, setting a default PVID to the bridge correctly
propagates the corresponding VLAN, in addition to the hardware bridging:

    # echo 42 > /sys/class/net/<bridge>/bridge/default_pvid

But considering CONFIG_BRIDGE_VLAN_FILTERING enabled, the hardware VLAN
filtering is enabled on all bridge members only when the user requests it:

    # echo 1 > /sys/class/net/<bridge>/bridge/vlan_filtering
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ec1606c0

net: dsa: mv88e6xxx: support VLAN filtering · 214cdb99

由 Vivien Didelot 提交于 2月 26, 2016

Implement port_vlan_filtering in the driver to toggle the related port
802.1Q mode between DISABLED and SECURE, on user request.
Signed-off-by: NVivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

214cdb99

net: dsa: mv88e6xxx: remove reserved VLANs · 46fbe5e5

由 Vivien Didelot 提交于 2月 26, 2016

Now that ports isolation is correctly configured when joining or leaving
a bridge, there is no need to rely on reserved VLANs to isolate
unbridged ports anymore. Thus remove them, and disable 802.1Q on setup.

This restores the expected behavior of hardware bridging for systems
without 802.1Q or VLAN filtering enabled.
Signed-off-by: NVivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

46fbe5e5

net: dsa: mv88e6xxx: restore VLANTable map control · b7666efe

由 Vivien Didelot 提交于 2月 26, 2016

The In Chip Port Based VLAN Table contains bits used to restrict which
output ports this input port can send frames to.

With the VLAN filtering enabled, these tables work in conjunction with
the VLAN Table Unit to allow egressing frames.

In order to remove the current dependency to BRIDGE_VLAN_FILTERING for
basic hardware bridging to work, it is necessary to restore a fine
control of each port's VLANTable, on setup and when a port joins or
leaves a bridge.
Signed-off-by: NVivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b7666efe

net: dsa: mv88e6xxx: assign dynamic FDB to bridges · 466dfa07

由 Vivien Didelot 提交于 2月 26, 2016

Give a new bridge a fresh FDB, assign it to its members, and restore a
fresh FDB to a port leaving a bridge.
Signed-off-by: NVivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

466dfa07

net: dsa: mv88e6xxx: assign default FDB to ports · 2db9ce1f

由 Vivien Didelot 提交于 2月 26, 2016

Restore per-port FDB. Assign them on setup, allow adding and deleting
addresses into them, and dump them.
Signed-off-by: NVivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2db9ce1f

net: dsa: mv88e6xxx: assign dynamic FDB to VLANs · 3285f9e8

由 Vivien Didelot 提交于 2月 26, 2016

Add a _mv88e6xxx_fid_new function which gives and flushes the lowest FID
available. Call it when preparing a new VTU entry.
Signed-off-by: NVivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3285f9e8

net: dsa: mv88e6xxx: extract single FDB dump · 74b6ba0d

由 Vivien Didelot 提交于 2月 26, 2016

Move out the code which dumps a single FDB to its own function.
Signed-off-by: NVivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

74b6ba0d

net: dsa: mv88e6xxx: extract single VLAN retrieval · 2fb5ef09

由 Vivien Didelot 提交于 2月 26, 2016

Rename _mv88e6xxx_vlan_init in _mv88e6xxx_vtu_new, eventually called
from a new _mv88e6xxx_vtu_get function, which abstracts the VTU GetNext
VID-1 trick to retrieve a single entry.
Signed-off-by: NVivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2fb5ef09

net: dsa: support VLAN filtering switchdev attr · fb2dabad

由 Vivien Didelot 提交于 2月 26, 2016

When a user explicitly requests VLAN filtering with something like:

    # echo 1 > /sys/class/net/<bridge>/bridge/vlan_filtering

Switchdev propagates a SWITCHDEV_ATTR_ID_BRIDGE_VLAN_FILTERING port
attribute.

Add support for it in the DSA layer with a new port_vlan_filtering
function to let drivers toggle 802.1Q filtering on user demand.
Signed-off-by: NVivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fb2dabad

Merge branch 'devlink' · 7f66ee41

由 David S. Miller 提交于 3月 01, 2016

Jiri Pirko says:

====================
Introduce devlink interface and first drivers to use it

There a is need for some userspace API that would allow to expose things
that are not directly related to any device class like net_device of
ib_device, but rather chip-wide/switch-ASIC-wide stuff.

Use cases:
1) get/set of port type (Ethernet/InfiniBand)
2) setting up port splitters - split port into multiple ones and squash again,
   enables usage of splitter cable
3) setting up shared buffers - shared among multiple ports within
   one chip (work in progress)
4) configuration of switch wide properties - resources division etc - This will
   allow to pass configuration that is unacceptable to be passed as
   a module option.

First patch of this set introduces a new generic Netlink based interface,
called "devlink". It is similar to nl80211 model and it is heavily
influenced by it, including the API definition. The devlink introduction patch
implements use cases 1) and 2). Other 2 are in development atm and will
be addressed by follow-ups.

It is very convenient for drivers to use devlink, as you can see in other
patches in this set.

Counterpart for devlink is userspace tool for now called "dl". Command line
interface and outputs are derived from "ip" tool so it should be easy
for users to get used to it.

It is available here as a standalone tool for now:
https://github.com/jpirko/devlink
After this is merge in kernel, I will include the "dl" or "devlink" tool
into iproute2 toolset.

Port type setting example:
	myhost:~$ dl help
	Usage: dl [ OPTIONS ] OBJECT { COMMAND | help }
	where  OBJECT := { dev | port | monitor }
	       OPTIONS := { -v/--verbose }

	myhost:~$ dl dev help
	Usage: dl dev show [DEV]

	myhost:~$ dl dev show
	pci/0000:01:00.0

	myhost:~$ dl port help
	Usage: dl port show [DEV/PORT_INDEX]
	Usage: dl port set DEV/PORT_INDEX [ type { eth | ib | auto} ]
	Usage: dl port split DEV/PORT_INDEX count
	Usage: dl port unsplit DEV/PORT_INDEX

	myhost:~$ dl port show
	pci/0000:01:00.0/1: type ib ibdev mlx4_0
	pci/0000:01:00.0/2: type ib ibdev mlx4_0

	myhost:~$ sudo dl port set pci/0000:01:00.0/1 type eth

	myhost:~$ dl port show
	pci/0000:01:00.0/1: type eth netdev ens4
	pci/0000:01:00.0/2: type ib ibdev mlx4_0

	myhost:~$ sudo dl port set ens4 type auto

	myhost:~$ dl port show
	pci/0000:01:00.0/1: type eth(auto) netdev ens4
	pci/0000:01:00.0/2: type ib ibdev mlx4_0

Port splitting example:
	myswitch:~$ sudo modprobe mlxsw_pci
	myswitch:~$ dl port
	pci/0000:03:00.0/1: type eth netdev eth0
	pci/0000:03:00.0/3: type eth netdev eth1
	pci/0000:03:00.0/5: type eth netdev eth2
	...
	pci/0000:03:00.0/63: type eth netdev eth31

	myswitch:~$ sudo dl port split pci/0000:03:00.0/1 2   (or "sudo dl port split eth0 2")

	myswitch:~$ dl port
	pci/0000:03:00.0/3: type eth netdev eth1
	pci/0000:03:00.0/5: type eth netdev eth2
	...
	pci/0000:03:00.0/63: type eth netdev eth31
	pci/0000:03:00.0/1: type eth netdev eth0 split_group 16
	pci/0000:03:00.0/2: type eth netdev eth32 split_group 16

	myswitch:~$ sudo dl port unsplit pci/0000:03:00.0/1

	myswitch:~$ dl port
	pci/0000:03:00.0/3: type eth netdev eth1
	pci/0000:03:00.0/5: type eth netdev eth2
	pci/0000:03:00.0/63: type eth netdev eth31
	pci/0000:03:00.0/1: type eth netdev eth0

v2->v3:
patch 1/9
 -removed generated devlink index and name, use bus name and dev name as
  a handle for all userspace originated commands. Along with that,
  remove sysfs stub. Requested by Hannes Sowa.
patch 2/9
 -add dev param to devlink_register (api change)
patch 4/9
 -add dev param to devlink_register (api change)
patch 9/9
 -set port's speed according to width fix by Ido
v1->v2:
patch 1/9
 -removed no longer used "devlink_dev" helper
 -fix couple of typos and misspells
patch 4/9:
 -removed SET_NETDEV_DEV set to devlink dev
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7f66ee41

mlxsw: spectrum: Introduce port splitting · 18f1e70c

由 Ido Schimmel 提交于 2月 26, 2016

Allow a user to split or unsplit a port using the newly introduced
devlink ops.

Once split, the original netdev is destroyed and 2 or 4 others are
created, according to user configuration. The new ports are like any
other port, with the sole difference of supporting a lower maximum
speed. When unsplit, the reverse process takes place.
Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
Signed-off-by: NJiri Pirko <jiri@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

18f1e70c

mlxsw: spectrum: Mark unused ports using NULL · a133318c

由 Ido Schimmel 提交于 2月 26, 2016

When splitting and unsplitting we'll destroy usable ports on the fly, so
mark them using a NULL pointer to indicate that their local port number
is free and can be re-used.
Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
Signed-off-by: NJiri Pirko <jiri@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a133318c

mlxsw: spectrum: Store local port to module mapping during init · 558c2d5e

由 Ido Schimmel 提交于 2月 26, 2016

The port netdevs are each associated with a different local port number
in the device. These local ports are grouped into groups of 4 (e.g.
(1-4), (5-8)) called clusters. The cluster constitutes the one of two
possible modules they can be mapped to. This mapping is board-specific
and done by the device's firmware during init.

When splitting a port by 4, the device requires us to first unmap all
the ports in the cluster and then map each to a single lane in the module
associated with the port netdev used as the handle for the operation.
This means that two port netdevs will disappear, as only 100Gb/s (4
lanes) ports can be split and we are guaranteed to have two of these
((1, 3), (5, 7) etc.) in a cluster.

When unsplit occurs we need to reinstantiate the two original 100Gb/s
ports and map each to its origianl module. Therefore, during driver init
store the initial local port to module mapping, so it can be used later
during unsplitting.

Note that a by 2 split doesn't require us to store the mapping, as we
only need to reinstantiate one port whose module is known.
Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
Signed-off-by: NJiri Pirko <jiri@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

558c2d5e

mlxsw: spectrum: Unmap local port from module during teardown · 3e9b27b8

由 Ido Schimmel 提交于 2月 26, 2016

When splitting a port we replace it with 2 or 4 other ports. To be able
to do that we need to remove the original port netdev and unmap it from
its module. However, we first mark it as disabled, as active ports
cannot be unmapped.
Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
Signed-off-by: NJiri Pirko <jiri@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3e9b27b8

mlxsw: core: Add devlink port splitter callbacks · 284ef803

由 Jiri Pirko 提交于 2月 26, 2016

Add middle layer in mlxsw core code to forward port split/unsplit calls
into specific ASIC drivers.
Signed-off-by: NJiri Pirko <jiri@mellanox.com>
Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

284ef803

mlxsw: Implement devlink interface · c4745500

由 Jiri Pirko 提交于 2月 26, 2016

Implement newly introduced devlink interface. Add devlink port instances
for every port and set the port types accordingly.
Signed-off-by: NJiri Pirko <jiri@mellanox.com>
Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c4745500

mlx4: Implement port type setting via devlink interface · b2facd95

由 Jiri Pirko 提交于 2月 26, 2016

So far, there has been an mlx4-specific sysfs file allowing user to
change port type to either Ethernet of InfiniBand. This is very
inconvenient.

Allow to expose the same ability to set port type in a generic way
using devlink interface.
Signed-off-by: NJiri Pirko <jiri@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b2facd95

mlx4: Implement devlink interface · 09d4d087

由 Jiri Pirko 提交于 2月 26, 2016

Implement newly introduced devlink interface. Add devlink port instances
for every port and set the port types accordingly.
Signed-off-by: NJiri Pirko <jiri@mellanox.com>
v2->v3:
-add dev param to devlink_register (api change)
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

09d4d087

Introduce devlink infrastructure · bfcd3a46

由 Jiri Pirko 提交于 2月 26, 2016

Introduce devlink infrastructure for drivers to register and expose to
userspace via generic Netlink interface.

There are two basic objects defined:
devlink - one instance for every "parent device", for example switch ASIC
devlink port - one instance for every physical port of the device.

This initial portion implements basic get/dump of objects to userspace.
Also, port splitter and port type setting is implemented.
Signed-off-by: NJiri Pirko <jiri@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

bfcd3a46

Merge branch 'tc-sw-only' · bd070e21

由 David S. Miller 提交于 3月 01, 2016

John Fastabend says:

====================
tc software only

This adds a software only flag to tc but incorporates a bunch of comments
from the original attempt at this.

First instead of having the offload decision logic be embedded in cls_u32
I lifted into cls_pkt.h so it can be used anywhere and named the flag
TCA_CLS_FLAGS_SKIP_HW (Thanks Jiri ;)

In order to do this I put the flag defines in pkt_cls.h as well. However
it was suggested that perhaps these flags could be lifted into the
upper layer of TCA_ as well but I'm afraid this can not be done with
existing tc design as far as I can tell. The problem is the filters are
packed and unpacked in the classifier specific code and pushing the flags
through the high level doesn't seem easily doable. And we already have
this design where classifiers handle generic options such as actions and
policers. So I think adding one more thing here is OK as 'tc', et. al.
already know how to handle this type of thing.
====================
Acked-by: NPravin B Shelar <pshelar@ovn.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

bd070e21

net: sched: cls_u32 add bit to specify software only rules · 9e8ce79c

由 John Fastabend 提交于 2月 26, 2016

In the initial implementation the only way to stop a rule from being
inserted into the hardware table was via the device feature flag.
However this doesn't work well when working on an end host system
where packets are expect to hit both the hardware and software
datapaths.

For example we can imagine a rule that will match an IP address and
increment a field. If we install this rule in both hardware and
software we may increment the field twice. To date we have only
added support for the drop action so we have been able to ignore
these cases. But as we extend the action support we will hit this
example plus more such cases. Arguably these are not even corner
cases in many working systems these cases will be common.

To avoid forcing the driver to always abort (i.e. the above example)
this patch adds a flag to add a rule in software only. A careful
user can use this flag to build software and hardware datapaths
that work together. One example we have found particularly useful
is to use hardware resources to set the skb->mark on the skb when
the match may be expensive to run in software but a mark lookup
in a hash table is cheap. The idea here is hardware can do in one
lookup what the u32 classifier may need to traverse multiple lists
and hash tables to compute. The flag is only passed down on inserts.
On deletion to avoid stale references in hardware we always try
to remove a rule if it exists.

The flags field is part of the classifier specific options. Although
it is tempting to lift this into the generic structure doing this
proves difficult do to how the tc netlink attributes are implemented
along with how the dump/change routines are called. There is also
precedence for putting seemingly generic pieces in the specific
classifier options such as TCA_U32_POLICE, TCA_U32_ACT, etc. So
although not ideal I've left FLAGS in the u32 options as well as it
simplifies the code greatly and user space has already learned how
to manage these bits ala 'tc' tool.

Another thing if trying to update a rule we require the flags to
be unchanged. This is to force user space, software u32 and
the hardware u32 to keep in sync. Thanks to Simon Horman for
catching this case.
Signed-off-by: NJohn Fastabend <john.r.fastabend@intel.com>
Acked-by: NJiri Pirko <jiri@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9e8ce79c

net: cls_u32: move TC offload feature bit into cls_u32 offload logic · 2b6ab0d3

由 John Fastabend 提交于 2月 26, 2016

In the original series drivers would get offload requests for cls_u32
rules even if the feature bit is disabled. This meant the driver had
to do a boiler plate check on the feature bit before adding/deleting
the rule.

This patch lifts the check into the core code and removes it from the
driver specific case.
Signed-off-by: NJohn Fastabend <john.r.fastabend@intel.com>
Acked-by: NJiri Pirko <jiri@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2b6ab0d3

net: sched: consolidate offload decision in cls_u32 · 6843e7a2

由 John Fastabend 提交于 2月 26, 2016

The offload decision was originally very basic and tied to if the dev
implemented the appropriate ndo op hook. The next step is to allow
the user to more flexibly define if any paticular rule should be
offloaded or not. In order to have this logic in one function lift
the current check into a helper routine tc_should_offload().
Signed-off-by: NJohn Fastabend <john.r.fastabend@intel.com>
Acked-by: NJiri Pirko <jiri@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6843e7a2

Merge branch 'ndo_set_rx_headroom' · d2e42a17

由 David S. Miller 提交于 3月 01, 2016

Paolo Abeni says:

====================
bridge/ovs: avoid skb head copy on frame forwarding

Currently, while when an OVS or Linux bridge is used to forward frames towards
some tunnel device, a skb_head_copy() may occur if the ingress device do not
provide enough headroom for the tx encapsulation.

This patch series tries to address the issue implementing a new ndo operation to
allow the master device to control the headroom used when allocating the skb on
frame reception.

Said operation is used by the Linux bridge to notify the bridged ports of
needed_headroom changes, and similar bookkeeping and behaviour is also added to
openvswitch, on a per datapath basis.

Finally, the operation is implemented for veth and tun device, which give
performance improvement in the 6-12% range when forwarding frames from said
devices towards a vxlan tunnel.

v2:
- fix netdev_get_fwd_headroom() behaviour
- remove some code duplication with the netdev_set_rx_headroom() and
   netdev_reset_rx_headroom() helpers
- handle headroom reset on [v]port removal/deletion
- initialize tun align to the old default value

v3:
- fix a comment typo
====================
Acked-by: NPravin B Shelar <pshelar@ovn.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d2e42a17

veth: implement ndo_set_rx_headroom · 163e5292

由 Paolo Abeni 提交于 2月 26, 2016

The rx headroom for veth dev is the peer device needed_headroom.
Avoid ping-pong updates setting the private flag IFF_PHONY_HEADROOM.

This avoids skb head reallocation when forwarding from a veth dev
towards a device adding some kind of encapsulation.

When transmitting frames below the MTU size towards a vxlan device,
this gives about 10% performance speed-up when OVS is used to connect
the veth and the vxlan device and a little more when using a
plain Linux bridge.
Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

163e5292

net/tun: implement ndo_set_rx_headroom · eaea34b2

由 Paolo Abeni 提交于 2月 26, 2016

ndo_set_rx_headroom controls the align value used by tun devices to
allocate skbs on frame reception.
When the xmit device adds a large encapsulation, this avoids an skb
head reallocation on forwarding.

The measured improvement when forwarding towards a vxlan dev with
frame size below the egress device MTU is as follow:

vxlan over ipv6, bridged: +6%
vxlan over ipv6, ovs: +7%

In case of ipv4 tunnels there is no improvement, since the tun
device default alignment provides enough headroom to avoid the skb
head reallocation.
Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

eaea34b2

ovs: propagate per dp max headroom to all vports · 3a927bc7

由 Paolo Abeni 提交于 2月 26, 2016

This patch implements bookkeeping support to compute the maximum
headroom for all the devices in each datapath. When said value
changes, the underlying devs are notified via the
ndo_set_rx_headroom method.

This also increases the internal vports xmit performance.
Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3a927bc7

bridge: notify enslaved devices of headroom changes · 45493d47

由 Paolo Abeni 提交于 2月 26, 2016

On bridge needed_headroom changes, the enslaved devices are
notified via the ndo_set_rx_headroom method
Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

45493d47

netdev: introduce ndo_set_rx_headroom · 871b642a

由 Paolo Abeni 提交于 2月 26, 2016

This method allows the controlling device (i.e. the bridge) to specify
additional headroom to be allocated for skb head on frame reception.
Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

871b642a

Merge branch 'bnxt_en-next' · 46d5efa9

由 David S. Miller 提交于 3月 01, 2016

Michael Chan says:

====================
bnxt_en: updates for net-next.

Miscellaneous updates covering SRIOV, IRQ coalescing, firmware logging and
package version for net-next.  Thanks.

v2: Updated description and added more comments for patch 1.  Fixed
function parameters formatting for patch 4.
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

46d5efa9

bnxt_en: Add hwrm_send_message_silent(). · 90e20921

由 Michael Chan 提交于 2月 26, 2016

This is used to send NVM_FIND_DIR_ENTRY messages which can return error
if the entry is not found.  This is normal and the error message will
cause unnecessary alarm, so silence it.
Signed-off-by: NMichael Chan <mchan@broadcom.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

90e20921

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功