提交 aae3dbb4 编写于 作者: L Linus Torvalds

Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next

Pull networking updates from David Miller:

 1) Support ipv6 checksum offload in sunvnet driver, from Shannon
    Nelson.

 2) Move to RB-tree instead of custom AVL code in inetpeer, from Eric
    Dumazet.

 3) Allow generic XDP to work on virtual devices, from John Fastabend.

 4) Add bpf device maps and XDP_REDIRECT, which can be used to build
    arbitrary switching frameworks using XDP. From John Fastabend.

 5) Remove UFO offloads from the tree, gave us little other than bugs.

 6) Remove the IPSEC flow cache, from Florian Westphal.

 7) Support ipv6 route offload in mlxsw driver.

 8) Support VF representors in bnxt_en, from Sathya Perla.

 9) Add support for forward error correction modes to ethtool, from
    Vidya Sagar Ravipati.

10) Add time filter for packet scheduler action dumping, from Jamal Hadi
    Salim.

11) Extend the zerocopy sendmsg() used by virtio and tap to regular
    sockets via MSG_ZEROCOPY. From Willem de Bruijn.

12) Significantly rework value tracking in the BPF verifier, from Edward
    Cree.

13) Add new jump instructions to eBPF, from Daniel Borkmann.

14) Rework rtnetlink plumbing so that operations can be run without
    taking the RTNL semaphore. From Florian Westphal.

15) Support XDP in tap driver, from Jason Wang.

16) Add 32-bit eBPF JIT for ARM, from Shubham Bansal.

17) Add Huawei hinic ethernet driver.

18) Allow to report MD5 keys in TCP inet_diag dumps, from Ivan
    Delalande.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1780 commits)
  i40e: point wb_desc at the nvm_wb_desc during i40e_read_nvm_aq
  i40e: avoid NVM acquire deadlock during NVM update
  drivers: net: xgene: Remove return statement from void function
  drivers: net: xgene: Configure tx/rx delay for ACPI
  drivers: net: xgene: Read tx/rx delay for ACPI
  rocker: fix kcalloc parameter order
  rds: Fix non-atomic operation on shared flag variable
  net: sched: don't use GFP_KERNEL under spin lock
  vhost_net: correctly check tx avail during rx busy polling
  net: mdio-mux: add mdio_mux parameter to mdio_mux_init()
  rxrpc: Make service connection lookup always check for retry
  net: stmmac: Delete dead code for MDIO registration
  gianfar: Fix Tx flow control deactivation
  cxgb4: Ignore MPS_TX_INT_CAUSE[Bubble] for T6
  cxgb4: Fix pause frame count in t4_get_port_stats
  cxgb4: fix memory leak
  tun: rename generic_xdp to skb_xdp
  tun: reserve extra headroom only when XDP is set
  net: dsa: bcm_sf2: Configure IMP port TC2QOS mapping
  net: dsa: bcm_sf2: Advertise number of egress queues
  ...

要显示的变更太多。

To preserve performance only 1000 of 1000+ files are displayed.
* Adaptrum Anarion ethernet controller
This device is a platform glue layer for stmmac.
Please see stmmac.txt for the other unchanged properties.
Required properties:
- compatible: Should be "adaptrum,anarion-gmac", "snps,dwmac"
- phy-mode: Should be "rgmii". Other modes are not currently supported.
Examples:
gmac1: ethernet@f2014000 {
compatible = "adaptrum,anarion-gmac", "snps,dwmac";
reg = <0xf2014000 0x4000>, <0xf2018100 8>;
interrupt-parent = <&core_intc>;
interrupts = <21>;
interrupt-names = "macirq";
clocks = <&core_clk>;
clock-names = "stmmaceth";
phy-mode = "rgmii";
};
Broadcom Bluetooth Chips
---------------------
This documents the binding structure and common properties for serial
attached Broadcom devices.
Serial attached Broadcom devices shall be a child node of the host UART
device the slave device is attached to.
Required properties:
- compatible: should contain one of the following:
* "brcm,bcm43438-bt"
Optional properties:
- max-speed: see Documentation/devicetree/bindings/serial/slave-device.txt
- shutdown-gpios: GPIO specifier, used to enable the BT module
- device-wakeup-gpios: GPIO specifier, used to wakeup the controller
- host-wakeup-gpios: GPIO specifier, used to wakeup the host processor
- clocks: clock specifier if external clock provided to the controller
- clock-names: should be "extclk"
Example:
&uart2 {
pinctrl-names = "default";
pinctrl-0 = <&uart2_pins>;
bluetooth {
compatible = "brcm,bcm43438-bt";
max-speed = <921600>;
};
};
......@@ -41,6 +41,11 @@ Optional properties (port):
- marvell,loopback: port is loopback mode
- phy: a phandle to a phy node defining the PHY address (as the reg
property, a single integer).
- interrupt-names: if more than a single interrupt for rx is given, must
be the name associated to the interrupts listed. Valid
names are: "tx-cpu0", "tx-cpu1", "tx-cpu2", "tx-cpu3",
"rx-shared", "link".
- marvell,system-controller: a phandle to the system controller.
Example for marvell,armada-375-pp2:
......@@ -80,19 +85,37 @@ cpm_ethernet: ethernet@0 {
clock-names = "pp_clk", "gop_clk", "gp_clk";
eth0: eth0 {
interrupts = <GIC_SPI 37 IRQ_TYPE_LEVEL_HIGH>;
interrupts = <ICU_GRP_NSR 39 IRQ_TYPE_LEVEL_HIGH>,
<ICU_GRP_NSR 43 IRQ_TYPE_LEVEL_HIGH>,
<ICU_GRP_NSR 47 IRQ_TYPE_LEVEL_HIGH>,
<ICU_GRP_NSR 51 IRQ_TYPE_LEVEL_HIGH>,
<ICU_GRP_NSR 55 IRQ_TYPE_LEVEL_HIGH>;
interrupt-names = "tx-cpu0", "tx-cpu1", "tx-cpu2",
"tx-cpu3", "rx-shared";
port-id = <0>;
gop-port-id = <0>;
};
eth1: eth1 {
interrupts = <GIC_SPI 38 IRQ_TYPE_LEVEL_HIGH>;
interrupts = <ICU_GRP_NSR 40 IRQ_TYPE_LEVEL_HIGH>,
<ICU_GRP_NSR 44 IRQ_TYPE_LEVEL_HIGH>,
<ICU_GRP_NSR 48 IRQ_TYPE_LEVEL_HIGH>,
<ICU_GRP_NSR 52 IRQ_TYPE_LEVEL_HIGH>,
<ICU_GRP_NSR 56 IRQ_TYPE_LEVEL_HIGH>;
interrupt-names = "tx-cpu0", "tx-cpu1", "tx-cpu2",
"tx-cpu3", "rx-shared";
port-id = <1>;
gop-port-id = <2>;
};
eth2: eth2 {
interrupts = <GIC_SPI 39 IRQ_TYPE_LEVEL_HIGH>;
interrupts = <ICU_GRP_NSR 41 IRQ_TYPE_LEVEL_HIGH>,
<ICU_GRP_NSR 45 IRQ_TYPE_LEVEL_HIGH>,
<ICU_GRP_NSR 49 IRQ_TYPE_LEVEL_HIGH>,
<ICU_GRP_NSR 53 IRQ_TYPE_LEVEL_HIGH>,
<ICU_GRP_NSR 57 IRQ_TYPE_LEVEL_HIGH>;
interrupt-names = "tx-cpu0", "tx-cpu1", "tx-cpu2",
"tx-cpu3", "rx-shared";
port-id = <2>;
gop-port-id = <3>;
};
......
......@@ -7,24 +7,30 @@ have dual GMAC each represented by a child node..
* Ethernet controller node
Required properties:
- compatible: Should be "mediatek,mt2701-eth"
- compatible: Should be
"mediatek,mt2701-eth": for MT2701 SoC
"mediatek,mt7623-eth", "mediatek,mt2701-eth": for MT7623 SoC
"mediatek,mt7622-eth": for MT7622 SoC
- reg: Address and length of the register set for the device
- interrupts: Should contain the three frame engines interrupts in numeric
order. These are fe_int0, fe_int1 and fe_int2.
- clocks: the clock used by the core
- clock-names: the names of the clock listed in the clocks property. These are
"ethif", "esw", "gp2", "gp1"
"ethif", "esw", "gp2", "gp1" : For MT2701 and MT7623 SoC
"ethif", "esw", "gp0", "gp1", "gp2", "sgmii_tx250m", "sgmii_rx250m",
"sgmii_cdr_ref", "sgmii_cdr_fb", "sgmii_ck", "eth2pll" : For MT7622 SoC
- power-domains: phandle to the power domain that the ethernet is part of
- resets: Should contain a phandle to the ethsys reset signal
- reset-names: Should contain the reset signal name "eth"
- mediatek,ethsys: phandle to the syscon node that handles the port setup
- mediatek,sgmiisys: phandle to the syscon node that handles the SGMII setup
which is required for those SoCs equipped with SGMII such as MT7622 SoC.
- mediatek,pctl: phandle to the syscon node that handles the ports slew rate
and driver current
Optional properties:
- interrupt-parent: Should be the phandle for the interrupt controller
that services interrupts for this device
* Ethernet MAC node
Required properties:
......
......@@ -52,6 +52,11 @@ Optional Properties:
Mark the corresponding energy efficient ethernet mode as broken and
request the ethernet to stop advertising it.
- phy-is-integrated: If set, indicates that the PHY is integrated into the same
physical package as the Ethernet MAC. If needed, muxers should be configured
to ensure the integrated PHY is used. The absence of this property indicates
the muxers should be configured so that the external PHY is used.
Example:
ethernet-phy@0 {
......
......@@ -4,19 +4,25 @@ This file provides information on what the device node for the Ethernet AVB
interface contains.
Required properties:
- compatible: "renesas,etheravb-r8a7790" if the device is a part of R8A7790 SoC.
"renesas,etheravb-r8a7791" if the device is a part of R8A7791 SoC.
"renesas,etheravb-r8a7792" if the device is a part of R8A7792 SoC.
"renesas,etheravb-r8a7793" if the device is a part of R8A7793 SoC.
"renesas,etheravb-r8a7794" if the device is a part of R8A7794 SoC.
"renesas,etheravb-r8a7795" if the device is a part of R8A7795 SoC.
"renesas,etheravb-r8a7796" if the device is a part of R8A7796 SoC.
"renesas,etheravb-rcar-gen2" for generic R-Car Gen 2 compatible interface.
"renesas,etheravb-rcar-gen3" for generic R-Car Gen 3 compatible interface.
- compatible: Must contain one or more of the following:
- "renesas,etheravb-r8a7743" for the R8A7743 SoC.
- "renesas,etheravb-r8a7745" for the R8A7745 SoC.
- "renesas,etheravb-r8a7790" for the R8A7790 SoC.
- "renesas,etheravb-r8a7791" for the R8A7791 SoC.
- "renesas,etheravb-r8a7792" for the R8A7792 SoC.
- "renesas,etheravb-r8a7793" for the R8A7793 SoC.
- "renesas,etheravb-r8a7794" for the R8A7794 SoC.
- "renesas,etheravb-rcar-gen2" as a fallback for the above
R-Car Gen2 and RZ/G1 devices.
When compatible with the generic version, nodes must list the
SoC-specific version corresponding to the platform first
followed by the generic version.
- "renesas,etheravb-r8a7795" for the R8A7795 SoC.
- "renesas,etheravb-r8a7796" for the R8A7796 SoC.
- "renesas,etheravb-rcar-gen3" as a fallback for the above
R-Car Gen3 devices.
When compatible with the generic version, nodes must list the
SoC-specific version corresponding to the platform first followed by
the generic version.
- reg: offset and length of (1) the register block and (2) the stream buffer.
- interrupts: A list of interrupt-specifiers, one for each entry in
......
......@@ -10,6 +10,7 @@ Required properties:
"rockchip,rk3366-gmac": found on RK3366 SoCs
"rockchip,rk3368-gmac": found on RK3368 SoCs
"rockchip,rk3399-gmac": found on RK3399 SoCs
"rockchip,rv1108-gmac": found on RV1108 SoCs
- reg: addresses and length of the register sets for the device.
- interrupts: Should contain the GMAC interrupts.
- interrupt-names: Should contain the interrupt names "macirq".
......
XILINX AXI ETHERNET Device Tree Bindings
--------------------------------------------------------
Also called AXI 1G/2.5G Ethernet Subsystem, the xilinx axi ethernet IP core
provides connectivity to an external ethernet PHY supporting different
interfaces: MII, GMII, RGMII, SGMII, 1000BaseX. It also includes two
segments of memory for buffering TX and RX, as well as the capability of
offloading TX/RX checksum calculation off the processor.
Management configuration is done through the AXI interface, while payload is
sent and received through means of an AXI DMA controller. This driver
includes the DMA driver code, so this driver is incompatible with AXI DMA
driver.
For more details about mdio please refer phy.txt file in the same directory.
Required properties:
- compatible : Must be one of "xlnx,axi-ethernet-1.00.a",
"xlnx,axi-ethernet-1.01.a", "xlnx,axi-ethernet-2.01.a"
- reg : Address and length of the IO space.
- interrupts : Should be a list of two interrupt, TX and RX.
- phy-handle : Should point to the external phy device.
See ethernet.txt file in the same directory.
- xlnx,rxmem : Set to allocated memory buffer for Rx/Tx in the hardware
Optional properties:
- phy-mode : See ethernet.txt
- xlnx,phy-type : Deprecated, do not use, but still accepted in preference
to phy-mode.
- xlnx,txcsum : 0 or empty for disabling TX checksum offload,
1 to enable partial TX checksum offload,
2 to enable full TX checksum offload
- xlnx,rxcsum : Same values as xlnx,txcsum but for RX checksum offload
Example:
axi_ethernet_eth: ethernet@40c00000 {
compatible = "xlnx,axi-ethernet-1.00.a";
device_type = "network";
interrupt-parent = <&microblaze_0_axi_intc>;
interrupts = <2 0>;
phy-mode = "mii";
reg = <0x40c00000 0x40000>;
xlnx,rxcsum = <0x2>;
xlnx,rxmem = <0x800>;
xlnx,txcsum = <0x2>;
phy-handle = <&phy0>;
axi_ethernetlite_0_mdio: mdio {
#address-cells = <1>;
#size-cells = <0>;
phy0: phy@0 {
device_type = "ethernet-phy";
reg = <1>;
};
};
};
mvebu comphy driver
-------------------
A comphy controller can be found on Marvell Armada 7k/8k on the CP110. It
provides a number of shared PHYs used by various interfaces (network, sata,
usb, PCIe...).
Required properties:
- compatible: should be "marvell,comphy-cp110"
- reg: should contain the comphy register location and length.
- marvell,system-controller: should contain a phandle to the
system controller node.
- #address-cells: should be 1.
- #size-cells: should be 0.
A sub-node is required for each comphy lane provided by the comphy.
Required properties (child nodes):
- reg: comphy lane number.
- #phy-cells : from the generic phy bindings, must be 1. Defines the
input port to use for a given comphy lane.
Example:
cpm_comphy: phy@120000 {
compatible = "marvell,comphy-cp110";
reg = <0x120000 0x6000>;
marvell,system-controller = <&cpm_syscon0>;
#address-cells = <1>;
#size-cells = <0>;
cpm_comphy0: phy@0 {
reg = <0>;
#phy-cells = <1>;
};
cpm_comphy1: phy@1 {
reg = <1>;
#phy-cells = <1>;
};
};
......@@ -30,8 +30,6 @@ atm.txt
- info on where to get ATM programs and support for Linux.
ax25.txt
- info on using AX.25 and NET/ROM code for Linux
batman-adv.txt
- B.A.T.M.A.N routing protocol on top of layer 2 Ethernet Frames.
baycom.txt
- info on the driver for Baycom style amateur radio modems
bonding.txt
......
==========
batman-adv
==========
Batman advanced is a new approach to wireless networking which does no longer
operate on the IP basis. Unlike the batman daemon, which exchanges information
using UDP packets and sets routing tables, batman-advanced operates on ISO/OSI
Layer 2 only and uses and routes (or better: bridges) Ethernet Frames. It
emulates a virtual network switch of all nodes participating. Therefore all
nodes appear to be link local, thus all higher operating protocols won't be
affected by any changes within the network. You can run almost any protocol
above batman advanced, prominent examples are: IPv4, IPv6, DHCP, IPX.
Batman advanced was implemented as a Linux kernel driver to reduce the overhead
to a minimum. It does not depend on any (other) network driver, and can be used
on wifi as well as ethernet lan, vpn, etc ... (anything with ethernet-style
layer 2).
Configuration
=============
Load the batman-adv module into your kernel::
$ insmod batman-adv.ko
The module is now waiting for activation. You must add some interfaces on which
batman can operate. After loading the module batman advanced will scan your
systems interfaces to search for compatible interfaces. Once found, it will
create subfolders in the ``/sys`` directories of each supported interface,
e.g.::
$ ls /sys/class/net/eth0/batman_adv/
elp_interval iface_status mesh_iface throughput_override
If an interface does not have the ``batman_adv`` subfolder, it probably is not
supported. Not supported interfaces are: loopback, non-ethernet and batman's
own interfaces.
Note: After the module was loaded it will continuously watch for new
interfaces to verify the compatibility. There is no need to reload the module
if you plug your USB wifi adapter into your machine after batman advanced was
initially loaded.
The batman-adv soft-interface can be created using the iproute2 tool ``ip``::
$ ip link add name bat0 type batadv
To activate a given interface simply attach it to the ``bat0`` interface::
$ ip link set dev eth0 master bat0
Repeat this step for all interfaces you wish to add. Now batman starts
using/broadcasting on this/these interface(s).
By reading the "iface_status" file you can check its status::
$ cat /sys/class/net/eth0/batman_adv/iface_status
active
To deactivate an interface you have to detach it from the "bat0" interface::
$ ip link set dev eth0 nomaster
All mesh wide settings can be found in batman's own interface folder::
$ ls /sys/class/net/bat0/mesh/
aggregated_ogms fragmentation isolation_mark routing_algo
ap_isolation gw_bandwidth log_level vlan0
bonding gw_mode multicast_mode
bridge_loop_avoidance gw_sel_class network_coding
distributed_arp_table hop_penalty orig_interval
There is a special folder for debugging information::
$ ls /sys/kernel/debug/batman_adv/bat0/
bla_backbone_table log neighbors transtable_local
bla_claim_table mcast_flags originators
dat_cache nc socket
gateways nc_nodes transtable_global
Some of the files contain all sort of status information regarding the mesh
network. For example, you can view the table of originators (mesh
participants) with::
$ cat /sys/kernel/debug/batman_adv/bat0/originators
Other files allow to change batman's behaviour to better fit your requirements.
For instance, you can check the current originator interval (value in
milliseconds which determines how often batman sends its broadcast packets)::
$ cat /sys/class/net/bat0/mesh/orig_interval
1000
and also change its value::
$ echo 3000 > /sys/class/net/bat0/mesh/orig_interval
In very mobile scenarios, you might want to adjust the originator interval to a
lower value. This will make the mesh more responsive to topology changes, but
will also increase the overhead.
Usage
=====
To make use of your newly created mesh, batman advanced provides a new
interface "bat0" which you should use from this point on. All interfaces added
to batman advanced are not relevant any longer because batman handles them for
you. Basically, one "hands over" the data by using the batman interface and
batman will make sure it reaches its destination.
The "bat0" interface can be used like any other regular interface. It needs an
IP address which can be either statically configured or dynamically (by using
DHCP or similar services)::
NodeA: ip link set up dev bat0
NodeA: ip addr add 192.168.0.1/24 dev bat0
NodeB: ip link set up dev bat0
NodeB: ip addr add 192.168.0.2/24 dev bat0
NodeB: ping 192.168.0.1
Note: In order to avoid problems remove all IP addresses previously assigned to
interfaces now used by batman advanced, e.g.::
$ ip addr flush dev eth0
Logging/Debugging
=================
All error messages, warnings and information messages are sent to the kernel
log. Depending on your operating system distribution this can be read in one of
a number of ways. Try using the commands: ``dmesg``, ``logread``, or looking in
the files ``/var/log/kern.log`` or ``/var/log/syslog``. All batman-adv messages
are prefixed with "batman-adv:" So to see just these messages try::
$ dmesg | grep batman-adv
When investigating problems with your mesh network, it is sometimes necessary to
see more detail debug messages. This must be enabled when compiling the
batman-adv module. When building batman-adv as part of kernel, use "make
menuconfig" and enable the option ``B.A.T.M.A.N. debugging``
(``CONFIG_BATMAN_ADV_DEBUG=y``).
Those additional debug messages can be accessed using a special file in
debugfs::
$ cat /sys/kernel/debug/batman_adv/bat0/log
The additional debug output is by default disabled. It can be enabled during
run time. Following log_levels are defined:
.. flat-table::
* - 0
- All debug output disabled
* - 1
- Enable messages related to routing / flooding / broadcasting
* - 2
- Enable messages related to route added / changed / deleted
* - 4
- Enable messages related to translation table operations
* - 8
- Enable messages related to bridge loop avoidance
* - 16
- Enable messages related to DAT, ARP snooping and parsing
* - 32
- Enable messages related to network coding
* - 64
- Enable messages related to multicast
* - 128
- Enable messages related to throughput meter
* - 255
- Enable all messages
The debug output can be changed at runtime using the file
``/sys/class/net/bat0/mesh/log_level``. e.g.::
$ echo 6 > /sys/class/net/bat0/mesh/log_level
will enable debug messages for when routes change.
Counters for different types of packets entering and leaving the batman-adv
module are available through ethtool::
$ ethtool --statistics bat0
batctl
======
As batman advanced operates on layer 2, all hosts participating in the virtual
switch are completely transparent for all protocols above layer 2. Therefore
the common diagnosis tools do not work as expected. To overcome these problems,
batctl was created. At the moment the batctl contains ping, traceroute, tcpdump
and interfaces to the kernel module settings.
For more information, please see the manpage (``man batctl``).
batctl is available on https://www.open-mesh.org/
Contact
=======
Please send us comments, experiences, questions, anything :)
IRC:
#batman on irc.freenode.org
Mailing-list:
b.a.t.m.a.n@open-mesh.org (optional subscription at
https://lists.open-mesh.org/mm/listinfo/b.a.t.m.a.n)
You can also contact the Authors:
* Marek Lindner <mareklindner@neomailbox.ch>
* Simon Wunderlich <sw@simonwunderlich.de>
BATMAN-ADV
----------
Batman advanced is a new approach to wireless networking which
does no longer operate on the IP basis. Unlike the batman daemon,
which exchanges information using UDP packets and sets routing
tables, batman-advanced operates on ISO/OSI Layer 2 only and uses
and routes (or better: bridges) Ethernet Frames. It emulates a
virtual network switch of all nodes participating. Therefore all
nodes appear to be link local, thus all higher operating proto-
cols won't be affected by any changes within the network. You can
run almost any protocol above batman advanced, prominent examples
are: IPv4, IPv6, DHCP, IPX.
Batman advanced was implemented as a Linux kernel driver to re-
duce the overhead to a minimum. It does not depend on any (other)
network driver, and can be used on wifi as well as ethernet lan,
vpn, etc ... (anything with ethernet-style layer 2).
CONFIGURATION
-------------
Load the batman-adv module into your kernel:
# insmod batman-adv.ko
The module is now waiting for activation. You must add some in-
terfaces on which batman can operate. After loading the module
batman advanced will scan your systems interfaces to search for
compatible interfaces. Once found, it will create subfolders in
the /sys directories of each supported interface, e.g.
# ls /sys/class/net/eth0/batman_adv/
# elp_interval iface_status mesh_iface throughput_override
If an interface does not have the "batman_adv" subfolder it prob-
ably is not supported. Not supported interfaces are: loopback,
non-ethernet and batman's own interfaces.
Note: After the module was loaded it will continuously watch for
new interfaces to verify the compatibility. There is no need to
reload the module if you plug your USB wifi adapter into your ma-
chine after batman advanced was initially loaded.
The batman-adv soft-interface can be created using the iproute2
tool "ip"
# ip link add name bat0 type batadv
To activate a given interface simply attach it to the "bat0"
interface
# ip link set dev eth0 master bat0
Repeat this step for all interfaces you wish to add. Now batman
starts using/broadcasting on this/these interface(s).
By reading the "iface_status" file you can check its status:
# cat /sys/class/net/eth0/batman_adv/iface_status
# active
To deactivate an interface you have to detach it from the
"bat0" interface:
# ip link set dev eth0 nomaster
All mesh wide settings can be found in batman's own interface
folder:
# ls /sys/class/net/bat0/mesh/
# aggregated_ogms fragmentation isolation_mark routing_algo
# ap_isolation gw_bandwidth log_level vlan0
# bonding gw_mode multicast_mode
# bridge_loop_avoidance gw_sel_class network_coding
# distributed_arp_table hop_penalty orig_interval
There is a special folder for debugging information:
# ls /sys/kernel/debug/batman_adv/bat0/
# bla_backbone_table log neighbors transtable_local
# bla_claim_table mcast_flags originators
# dat_cache nc socket
# gateways nc_nodes transtable_global
Some of the files contain all sort of status information regard-
ing the mesh network. For example, you can view the table of
originators (mesh participants) with:
# cat /sys/kernel/debug/batman_adv/bat0/originators
Other files allow to change batman's behaviour to better fit your
requirements. For instance, you can check the current originator
interval (value in milliseconds which determines how often batman
sends its broadcast packets):
# cat /sys/class/net/bat0/mesh/orig_interval
# 1000
and also change its value:
# echo 3000 > /sys/class/net/bat0/mesh/orig_interval
In very mobile scenarios, you might want to adjust the originator
interval to a lower value. This will make the mesh more respon-
sive to topology changes, but will also increase the overhead.
USAGE
-----
To make use of your newly created mesh, batman advanced provides
a new interface "bat0" which you should use from this point on.
All interfaces added to batman advanced are not relevant any
longer because batman handles them for you. Basically, one "hands
over" the data by using the batman interface and batman will make
sure it reaches its destination.
The "bat0" interface can be used like any other regular inter-
face. It needs an IP address which can be either statically con-
figured or dynamically (by using DHCP or similar services):
# NodeA: ip link set up dev bat0
# NodeA: ip addr add 192.168.0.1/24 dev bat0
# NodeB: ip link set up dev bat0
# NodeB: ip addr add 192.168.0.2/24 dev bat0
# NodeB: ping 192.168.0.1
Note: In order to avoid problems remove all IP addresses previ-
ously assigned to interfaces now used by batman advanced, e.g.
# ip addr flush dev eth0
LOGGING/DEBUGGING
-----------------
All error messages, warnings and information messages are sent to
the kernel log. Depending on your operating system distribution
this can be read in one of a number of ways. Try using the com-
mands: dmesg, logread, or looking in the files /var/log/kern.log
or /var/log/syslog. All batman-adv messages are prefixed with
"batman-adv:" So to see just these messages try
# dmesg | grep batman-adv
When investigating problems with your mesh network it is some-
times necessary to see more detail debug messages. This must be
enabled when compiling the batman-adv module. When building bat-
man-adv as part of kernel, use "make menuconfig" and enable the
option "B.A.T.M.A.N. debugging".
Those additional debug messages can be accessed using a special
file in debugfs
# cat /sys/kernel/debug/batman_adv/bat0/log
The additional debug output is by default disabled. It can be en-
abled during run time. Following log_levels are defined:
0 - All debug output disabled
1 - Enable messages related to routing / flooding / broadcasting
2 - Enable messages related to route added / changed / deleted
4 - Enable messages related to translation table operations
8 - Enable messages related to bridge loop avoidance
16 - Enable messages related to DAT, ARP snooping and parsing
32 - Enable messages related to network coding
64 - Enable messages related to multicast
128 - Enable messages related to throughput meter
255 - Enable all messages
The debug output can be changed at runtime using the file
/sys/class/net/bat0/mesh/log_level. e.g.
# echo 6 > /sys/class/net/bat0/mesh/log_level
will enable debug messages for when routes change.
Counters for different types of packets entering and leaving the
batman-adv module are available through ethtool:
# ethtool --statistics bat0
BATCTL
------
As batman advanced operates on layer 2 all hosts participating in
the virtual switch are completely transparent for all protocols
above layer 2. Therefore the common diagnosis tools do not work
as expected. To overcome these problems batctl was created. At
the moment the batctl contains ping, traceroute, tcpdump and
interfaces to the kernel module settings.
For more information, please see the manpage (man batctl).
batctl is available on https://www.open-mesh.org/
CONTACT
-------
Please send us comments, experiences, questions, anything :)
IRC: #batman on irc.freenode.org
Mailing-list: b.a.t.m.a.n@open-mesh.org (optional subscription
at https://lists.open-mesh.org/mm/listinfo/b.a.t.m.a.n)
You can also contact the Authors:
Marek Lindner <mareklindner@neomailbox.ch>
Simon Wunderlich <sw@simonwunderlich.de>
......@@ -13,6 +13,7 @@ Contents
- Configuring DPAA Ethernet in your kernel
- DPAA Ethernet Frame Processing
- DPAA Ethernet Features
- DPAA IRQ Affinity and Receive Side Scaling
- Debugging
DPAA Ethernet Overview
......@@ -147,7 +148,10 @@ gradually.
The driver has Rx and Tx checksum offloading for UDP and TCP. Currently the Rx
checksum offload feature is enabled by default and cannot be controlled through
ethtool.
ethtool. Also, rx-flow-hash and rx-hashing was added. The addition of RSS
provides a big performance boost for the forwarding scenarios, allowing
different traffic flows received by one interface to be processed by different
CPUs in parallel.
The driver has support for multiple prioritized Tx traffic classes. Priorities
range from 0 (lowest) to 3 (highest). These are mapped to HW workqueues with
......@@ -166,6 +170,68 @@ classes as follows:
tc qdisc add dev <int> root handle 1: \
mqprio num_tc 4 map 0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3 hw 1
DPAA IRQ Affinity and Receive Side Scaling
==========================================
Traffic coming on the DPAA Rx queues or on the DPAA Tx confirmation
queues is seen by the CPU as ingress traffic on a certain portal.
The DPAA QMan portal interrupts are affined each to a certain CPU.
The same portal interrupt services all the QMan portal consumers.
By default the DPAA Ethernet driver enables RSS, making use of the
DPAA FMan Parser and Keygen blocks to distribute traffic on 128
hardware frame queues using a hash on IP v4/v6 source and destination
and L4 source and destination ports, in present in the received frame.
When RSS is disabled, all traffic received by a certain interface is
received on the default Rx frame queue. The default DPAA Rx frame
queues are configured to put the received traffic into a pool channel
that allows any available CPU portal to dequeue the ingress traffic.
The default frame queues have the HOLDACTIVE option set, ensuring that
traffic bursts from a certain queue are serviced by the same CPU.
This ensures a very low rate of frame reordering. A drawback of this
is that only one CPU at a time can service the traffic received by a
certain interface when RSS is not enabled.
To implement RSS, the DPAA Ethernet driver allocates an extra set of
128 Rx frame queues that are configured to dedicated channels, in a
round-robin manner. The mapping of the frame queues to CPUs is now
hardcoded, there is no indirection table to move traffic for a certain
FQ (hash result) to another CPU. The ingress traffic arriving on one
of these frame queues will arrive at the same portal and will always
be processed by the same CPU. This ensures intra-flow order preservation
and workload distribution for multiple traffic flows.
RSS can be turned off for a certain interface using ethtool, i.e.
# ethtool -N fm1-mac9 rx-flow-hash tcp4 ""
To turn it back on, one needs to set rx-flow-hash for tcp4/6 or udp4/6:
# ethtool -N fm1-mac9 rx-flow-hash udp4 sfdn
There is no independent control for individual protocols, any command
run for one of tcp4|udp4|ah4|esp4|sctp4|tcp6|udp6|ah6|esp6|sctp6 is
going to control the rx-flow-hashing for all protocols on that interface.
Besides using the FMan Keygen computed hash for spreading traffic on the
128 Rx FQs, the DPAA Ethernet driver also sets the skb hash value when
the NETIF_F_RXHASH feature is on (active by default). This can be turned
on or off through ethtool, i.e.:
# ethtool -K fm1-mac9 rx-hashing off
# ethtool -k fm1-mac9 | grep hash
receive-hashing: off
# ethtool -K fm1-mac9 rx-hashing on
Actual changes:
receive-hashing: on
# ethtool -k fm1-mac9 | grep hash
receive-hashing: on
Please note that Rx hashing depends upon the rx-flow-hashing being on
for that interface - turning off rx-flow-hashing will also disable the
rx-hashing (without ethtool reporting it as off as that depends on the
NETIF_F_RXHASH feature flag).
Debugging
=========
......
......@@ -596,8 +596,8 @@ skb pointer). All constraints and restrictions from bpf_check_classic() apply
before a conversion to the new layout is being done behind the scenes!
Currently, the classic BPF format is being used for JITing on most 32-bit
architectures, whereas x86-64, aarch64, s390x, powerpc64, sparc64 perform JIT
compilation from eBPF instruction set.
architectures, whereas x86-64, aarch64, s390x, powerpc64, sparc64, arm32 perform
JIT compilation from eBPF instruction set.
Some core changes of the new internal format:
......@@ -793,7 +793,7 @@ Some core changes of the new internal format:
bpf_exit
After the call the registers R1-R5 contain junk values and cannot be read.
In the future an eBPF verifier can be used to validate internal BPF programs.
An in-kernel eBPF verifier is used to validate internal BPF programs.
Also in the new design, eBPF is limited to 4096 insns, which means that any
program will terminate quickly and will only call a fixed number of kernel
......@@ -906,6 +906,10 @@ If BPF_CLASS(code) == BPF_JMP, BPF_OP(code) is one of:
BPF_JSGE 0x70 /* eBPF only: signed '>=' */
BPF_CALL 0x80 /* eBPF only: function call */
BPF_EXIT 0x90 /* eBPF only: function return */
BPF_JLT 0xa0 /* eBPF only: unsigned '<' */
BPF_JLE 0xb0 /* eBPF only: unsigned '<=' */
BPF_JSLT 0xc0 /* eBPF only: signed '<' */
BPF_JSLE 0xd0 /* eBPF only: signed '<=' */
So BPF_ADD | BPF_X | BPF_ALU means 32-bit addition in both classic BPF
and eBPF. There are only two registers in classic BPF, so it means A += X.
......@@ -1017,7 +1021,7 @@ At the start of the program the register R1 contains a pointer to context
and has type PTR_TO_CTX.
If verifier sees an insn that does R2=R1, then R2 has now type
PTR_TO_CTX as well and can be used on the right hand side of expression.
If R1=PTR_TO_CTX and insn is R2=R1+R1, then R2=UNKNOWN_VALUE,
If R1=PTR_TO_CTX and insn is R2=R1+R1, then R2=SCALAR_VALUE,
since addition of two valid pointers makes invalid pointer.
(In 'secure' mode verifier will reject any type of pointer arithmetic to make
sure that kernel addresses don't leak to unprivileged users)
......@@ -1039,7 +1043,7 @@ is a correct program. If there was R1 instead of R6, it would have
been rejected.
load/store instructions are allowed only with registers of valid types, which
are PTR_TO_CTX, PTR_TO_MAP, FRAME_PTR. They are bounds and alignment checked.
are PTR_TO_CTX, PTR_TO_MAP, PTR_TO_STACK. They are bounds and alignment checked.
For example:
bpf_mov R1 = 1
bpf_mov R2 = 2
......@@ -1058,7 +1062,7 @@ intends to load a word from address R6 + 8 and store it into R0
If R6=PTR_TO_CTX, via is_valid_access() callback the verifier will know
that offset 8 of size 4 bytes can be accessed for reading, otherwise
the verifier will reject the program.
If R6=FRAME_PTR, then access should be aligned and be within
If R6=PTR_TO_STACK, then access should be aligned and be within
stack bounds, which are [-MAX_BPF_STACK, 0). In this example offset is 8,
so it will fail verification, since it's out of bounds.
......@@ -1069,7 +1073,7 @@ For example:
bpf_ld R0 = *(u32 *)(R10 - 4)
bpf_exit
is invalid program.
Though R10 is correct read-only register and has type FRAME_PTR
Though R10 is correct read-only register and has type PTR_TO_STACK
and R10 - 4 is within stack bounds, there were no stores into that location.
Pointer register spill/fill is tracked as well, since four (R6-R9)
......@@ -1094,6 +1098,71 @@ all use cases.
See details of eBPF verifier in kernel/bpf/verifier.c
Register value tracking
-----------------------
In order to determine the safety of an eBPF program, the verifier must track
the range of possible values in each register and also in each stack slot.
This is done with 'struct bpf_reg_state', defined in include/linux/
bpf_verifier.h, which unifies tracking of scalar and pointer values. Each
register state has a type, which is either NOT_INIT (the register has not been
written to), SCALAR_VALUE (some value which is not usable as a pointer), or a
pointer type. The types of pointers describe their base, as follows:
PTR_TO_CTX Pointer to bpf_context.
CONST_PTR_TO_MAP Pointer to struct bpf_map. "Const" because arithmetic
on these pointers is forbidden.
PTR_TO_MAP_VALUE Pointer to the value stored in a map element.
PTR_TO_MAP_VALUE_OR_NULL
Either a pointer to a map value, or NULL; map accesses
(see section 'eBPF maps', below) return this type,
which becomes a PTR_TO_MAP_VALUE when checked != NULL.
Arithmetic on these pointers is forbidden.
PTR_TO_STACK Frame pointer.
PTR_TO_PACKET skb->data.
PTR_TO_PACKET_END skb->data + headlen; arithmetic forbidden.
However, a pointer may be offset from this base (as a result of pointer
arithmetic), and this is tracked in two parts: the 'fixed offset' and 'variable
offset'. The former is used when an exactly-known value (e.g. an immediate
operand) is added to a pointer, while the latter is used for values which are
not exactly known. The variable offset is also used in SCALAR_VALUEs, to track
the range of possible values in the register.
The verifier's knowledge about the variable offset consists of:
* minimum and maximum values as unsigned
* minimum and maximum values as signed
* knowledge of the values of individual bits, in the form of a 'tnum': a u64
'mask' and a u64 'value'. 1s in the mask represent bits whose value is unknown;
1s in the value represent bits known to be 1. Bits known to be 0 have 0 in both
mask and value; no bit should ever be 1 in both. For example, if a byte is read
into a register from memory, the register's top 56 bits are known zero, while
the low 8 are unknown - which is represented as the tnum (0x0; 0xff). If we
then OR this with 0x40, we get (0x40; 0xcf), then if we add 1 we get (0x0;
0x1ff), because of potential carries.
Besides arithmetic, the register state can also be updated by conditional
branches. For instance, if a SCALAR_VALUE is compared > 8, in the 'true' branch
it will have a umin_value (unsigned minimum value) of 9, whereas in the 'false'
branch it will have a umax_value of 8. A signed compare (with BPF_JSGT or
BPF_JSGE) would instead update the signed minimum/maximum values. Information
from the signed and unsigned bounds can be combined; for instance if a value is
first tested < 8 and then tested s> 4, the verifier will conclude that the value
is also > 4 and s< 8, since the bounds prevent crossing the sign boundary.
PTR_TO_PACKETs with a variable offset part have an 'id', which is common to all
pointers sharing that same variable offset. This is important for packet range
checks: after adding some variable to a packet pointer, if you then copy it to
another register and (say) add a constant 4, both registers will share the same
'id' but one will have a fixed offset of +4. Then if it is bounds-checked and
found to be less than a PTR_TO_PACKET_END, the other register is now known to
have a safe range of at least 4 bytes. See 'Direct packet access', below, for
more on PTR_TO_PACKET ranges.
The 'id' field is also used on PTR_TO_MAP_VALUE_OR_NULL, common to all copies of
the pointer returned from a map lookup. This means that when one copy is
checked and found to be non-NULL, all copies can become PTR_TO_MAP_VALUEs.
As well as range-checking, the tracked information is also used for enforcing
alignment of pointer accesses. For instance, on most systems the packet pointer
is 2 bytes after a 4-byte alignment. If a program adds 14 bytes to that to jump
over the Ethernet header, then reads IHL and addes (IHL * 4), the resulting
pointer will have a variable offset known to be 4n+2 for some n, so adding the 2
bytes (NET_IP_ALIGN) gives a 4-byte alignment and so word-sized accesses through
that pointer are safe.
Direct packet access
--------------------
In cls_bpf and act_bpf programs the verifier allows direct access to the packet
......@@ -1121,7 +1190,7 @@ it now points to 'skb->data + 14' and accessible range is [R5, R5 + 14 - 14)
which is zero bytes.
More complex packet access may look like:
R0=imm1 R1=ctx R3=pkt(id=0,off=0,r=14) R4=pkt_end R5=pkt(id=0,off=14,r=14) R10=fp
R0=inv1 R1=ctx R3=pkt(id=0,off=0,r=14) R4=pkt_end R5=pkt(id=0,off=14,r=14) R10=fp
6: r0 = *(u8 *)(r3 +7) /* load 7th byte from the packet */
7: r4 = *(u8 *)(r3 +12)
8: r4 *= 14
......@@ -1135,26 +1204,31 @@ More complex packet access may look like:
16: r2 += 8
17: r1 = *(u32 *)(r1 +80) /* load skb->data_end */
18: if r2 > r1 goto pc+2
R0=inv56 R1=pkt_end R2=pkt(id=2,off=8,r=8) R3=pkt(id=2,off=0,r=8) R4=inv52 R5=pkt(id=0,off=14,r=14) R10=fp
R0=inv(id=0,umax_value=255,var_off=(0x0; 0xff)) R1=pkt_end R2=pkt(id=2,off=8,r=8) R3=pkt(id=2,off=0,r=8) R4=inv(id=0,umax_value=3570,var_off=(0x0; 0xfffe)) R5=pkt(id=0,off=14,r=14) R10=fp
19: r1 = *(u8 *)(r3 +4)
The state of the register R3 is R3=pkt(id=2,off=0,r=8)
id=2 means that two 'r3 += rX' instructions were seen, so r3 points to some
offset within a packet and since the program author did
'if (r3 + 8 > r1) goto err' at insn #18, the safe range is [R3, R3 + 8).
The verifier only allows 'add' operation on packet registers. Any other
operation will set the register state to 'unknown_value' and it won't be
The verifier only allows 'add'/'sub' operations on packet registers. Any other
operation will set the register state to 'SCALAR_VALUE' and it won't be
available for direct packet access.
Operation 'r3 += rX' may overflow and become less than original skb->data,
therefore the verifier has to prevent that. So it tracks the number of
upper zero bits in all 'uknown_value' registers, so when it sees
'r3 += rX' instruction and rX is more than 16-bit value, it will error as:
"cannot add integer value with N upper zero bits to ptr_to_packet"
therefore the verifier has to prevent that. So when it sees 'r3 += rX'
instruction and rX is more than 16-bit value, any subsequent bounds-check of r3
against skb->data_end will not give us 'range' information, so attempts to read
through the pointer will give "invalid access to packet" error.
Ex. after insn 'r4 = *(u8 *)(r3 +12)' (insn #7 above) the state of r4 is
R4=inv56 which means that upper 56 bits on the register are guaranteed
to be zero. After insn 'r4 *= 14' the state becomes R4=inv52, since
multiplying 8-bit value by constant 14 will keep upper 52 bits as zero.
Similarly 'r2 >>= 48' will make R2=inv48, since the shift is not sign
extending. This logic is implemented in evaluate_reg_alu() function.
R4=inv(id=0,umax_value=255,var_off=(0x0; 0xff)) which means that upper 56 bits
of the register are guaranteed to be zero, and nothing is known about the lower
8 bits. After insn 'r4 *= 14' the state becomes
R4=inv(id=0,umax_value=3570,var_off=(0x0; 0xfffe)), since multiplying an 8-bit
value by constant 14 will keep upper 52 bits as zero, also the least significant
bit will be zero as 14 is even. Similarly 'r2 >>= 48' will make
R2=inv(id=0,umax_value=65535,var_off=(0x0; 0xffff)), since the shift is not sign
extending. This logic is implemented in adjust_reg_min_max_vals() function,
which calls adjust_ptr_min_max_vals() for adding pointer to scalar (or vice
versa) and adjust_scalar_min_max_vals() for operations on two scalars.
The end result is that bpf program author can access packet directly
using normal C code as:
......@@ -1214,6 +1288,22 @@ The map is defined by:
. key size in bytes
. value size in bytes
Pruning
-------
The verifier does not actually walk all possible paths through the program. For
each new branch to analyse, the verifier looks at all the states it's previously
been in when at this instruction. If any of them contain the current state as a
subset, the branch is 'pruned' - that is, the fact that the previous state was
accepted implies the current state would be as well. For instance, if in the
previous state, r1 held a packet-pointer, and in the current state, r1 holds a
packet-pointer with a range as long or longer and at least as strict an
alignment, then r1 is safe. Similarly, if r2 was NOT_INIT before then it can't
have been used by any path from that point, so any value in r2 (including
another NOT_INIT) is safe. The implementation is in the function regsafe().
Pruning considers not only the registers but also the stack (and any spilled
registers it may hold). They must all be safe for the branch to be pruned.
This is implemented in states_equal().
Understanding eBPF verifier messages
------------------------------------
......
Linux Kernel Driver for Huawei Intelligent NIC(HiNIC) family
============================================================
Overview:
=========
HiNIC is a network interface card for the Data Center Area.
The driver supports a range of link-speed devices (10GbE, 25GbE, 40GbE, etc.).
The driver supports also a negotiated and extendable feature set.
Some HiNIC devices support SR-IOV. This driver is used for Physical Function
(PF).
HiNIC devices support MSI-X interrupt vector for each Tx/Rx queue and
adaptive interrupt moderation.
HiNIC devices support also various offload features such as checksum offload,
TCP Transmit Segmentation Offload(TSO), Receive-Side Scaling(RSS) and
LRO(Large Receive Offload).
Supported PCI vendor ID/device IDs:
===================================
19e5:1822 - HiNIC PF
Driver Architecture and Source Code:
====================================
hinic_dev - Implement a Logical Network device that is independent from
specific HW details about HW data structure formats.
hinic_hwdev - Implement the HW details of the device and include the components
for accessing the PCI NIC.
hinic_hwdev contains the following components:
===============================================
HW Interface:
=============
The interface for accessing the pci device (DMA memory and PCI BARs).
(hinic_hw_if.c, hinic_hw_if.h)
Configuration Status Registers Area that describes the HW Registers on the
configuration and status BAR0. (hinic_hw_csr.h)
MGMT components:
================
Asynchronous Event Queues(AEQs) - The event queues for receiving messages from
the MGMT modules on the cards. (hinic_hw_eqs.c, hinic_hw_eqs.h)
Application Programmable Interface commands(API CMD) - Interface for sending
MGMT commands to the card. (hinic_hw_api_cmd.c, hinic_hw_api_cmd.h)
Management (MGMT) - the PF to MGMT channel that uses API CMD for sending MGMT
commands to the card and receives notifications from the MGMT modules on the
card by AEQs. Also set the addresses of the IO CMDQs in HW.
(hinic_hw_mgmt.c, hinic_hw_mgmt.h)
IO components:
==============
Completion Event Queues(CEQs) - The completion Event Queues that describe IO
tasks that are finished. (hinic_hw_eqs.c, hinic_hw_eqs.h)
Work Queues(WQ) - Contain the memory and operations for use by CMD queues and
the Queue Pairs. The WQ is a Memory Block in a Page. The Block contains
pointers to Memory Areas that are the Memory for the Work Queue Elements(WQEs).
(hinic_hw_wq.c, hinic_hw_wq.h)
Command Queues(CMDQ) - The queues for sending commands for IO management and is
used to set the QPs addresses in HW. The commands completion events are
accumulated on the CEQ that is configured to receive the CMDQ completion events.
(hinic_hw_cmdq.c, hinic_hw_cmdq.h)
Queue Pairs(QPs) - The HW Receive and Send queues for Receiving and Transmitting
Data. (hinic_hw_qp.c, hinic_hw_qp.h, hinic_hw_qp_ctxt.h)
IO - de/constructs all the IO components. (hinic_hw_io.c, hinic_hw_io.h)
HW device:
==========
HW device - de/constructs the HW Interface, the MGMT components on the
initialization of the driver and the IO components on the case of Interface
UP/DOWN Events. (hinic_hw_dev.c, hinic_hw_dev.h)
hinic_dev contains the following components:
===============================================
PCI ID table - Contains the supported PCI Vendor/Device IDs.
(hinic_pci_tbl.h)
Port Commands - Send commands to the HW device for port management
(MAC, Vlan, MTU, ...). (hinic_port.c, hinic_port.h)
Tx Queues - Logical Tx Queues that use the HW Send Queues for transmit.
The Logical Tx queue is not dependent on the format of the HW Send Queue.
(hinic_tx.c, hinic_tx.h)
Rx Queues - Logical Rx Queues that use the HW Receive Queues for receive.
The Logical Rx queue is not dependent on the format of the HW Receive Queue.
(hinic_rx.c, hinic_rx.h)
hinic_dev - de/constructs the Logical Tx and Rx Queues.
(hinic_main.c, hinic_dev.h)
Miscellaneous:
=============
Common functions that are used by HW and Logical Device.
(hinic_common.c, hinic_common.h)
Support
=======
If an issue is identified with the released source code on the supported kernel
with a supported adapter, email the specific information related to the issue to
aviad.krawczyk@huawei.com.
......@@ -6,6 +6,7 @@ Contents:
.. toctree::
:maxdepth: 2
batman-adv
kapi
z8530book
......
......@@ -109,7 +109,10 @@ neigh/default/unres_qlen_bytes - INTEGER
queued for each unresolved address by other network layers.
(added in linux 3.3)
Setting negative value is meaningless and will return error.
Default: 65536 Bytes(64KB)
Default: SK_WMEM_MAX, (same as net.core.wmem_default).
Exact value depends on architecture and kernel options,
but should be enough to allow queuing 256 packets
of medium size.
neigh/default/unres_qlen - INTEGER
The maximum number of packets which may be queued for each
......@@ -119,7 +122,7 @@ neigh/default/unres_qlen - INTEGER
unexpected packet loss. The current default value is calculated
according to default value of unres_qlen_bytes and true size of
packet.
Default: 31
Default: 101
mtu_expires - INTEGER
Time, in seconds, that cached PMTU information is kept.
......@@ -353,12 +356,7 @@ tcp_l3mdev_accept - BOOLEAN
compiled with CONFIG_NET_L3_MASTER_DEV.
tcp_low_latency - BOOLEAN
If set, the TCP stack makes decisions that prefer lower
latency as opposed to higher throughput. By default, this
option is not set meaning that higher throughput is preferred.
An example of an application where this default should be
changed would be a Beowulf compute cluster.
Default: 0
This is a legacy option, it has no effect anymore.
tcp_max_orphans - INTEGER
Maximal number of TCP sockets not attached to any user file handle,
......@@ -1291,8 +1289,7 @@ tag - INTEGER
xfrm4_gc_thresh - INTEGER
The threshold at which we will start garbage collecting for IPv4
destination cache entries. At twice this value the system will
refuse new allocations. The value must be set below the flowcache
limit (4096 * number of online cpus) to take effect.
refuse new allocations.
igmp_link_local_mcast_reports - BOOLEAN
Enable IGMP reports for link local multicast groups in the
......@@ -1356,6 +1353,15 @@ flowlabel_state_ranges - BOOLEAN
FALSE: disabled
Default: true
flowlabel_reflect - BOOLEAN
Automatically reflect the flow label. Needed for Path MTU
Discovery to work with Equal Cost Multipath Routing in anycast
environments. See RFC 7690 and:
https://tools.ietf.org/html/draft-wang-6man-flow-label-reflection-01
TRUE: enabled
FALSE: disabled
Default: FALSE
anycast_src_echo_reply - BOOLEAN
Controls the use of anycast addresses as source addresses for ICMPv6
echo reply
......@@ -1778,8 +1784,7 @@ ratelimit - INTEGER
xfrm6_gc_thresh - INTEGER
The threshold at which we will start garbage collecting for IPv6
destination cache entries. At twice this value the system will
refuse new allocations. The value must be set below the flowcache
limit (4096 * number of online cpus) to take effect.
refuse new allocations.
IPv6 Update by:
......
============
MSG_ZEROCOPY
============
Intro
=====
The MSG_ZEROCOPY flag enables copy avoidance for socket send calls.
The feature is currently implemented for TCP sockets.
Opportunity and Caveats
-----------------------
Copying large buffers between user process and kernel can be
expensive. Linux supports various interfaces that eschew copying,
such as sendpage and splice. The MSG_ZEROCOPY flag extends the
underlying copy avoidance mechanism to common socket send calls.
Copy avoidance is not a free lunch. As implemented, with page pinning,
it replaces per byte copy cost with page accounting and completion
notification overhead. As a result, MSG_ZEROCOPY is generally only
effective at writes over around 10 KB.
Page pinning also changes system call semantics. It temporarily shares
the buffer between process and network stack. Unlike with copying, the
process cannot immediately overwrite the buffer after system call
return without possibly modifying the data in flight. Kernel integrity
is not affected, but a buggy program can possibly corrupt its own data
stream.
The kernel returns a notification when it is safe to modify data.
Converting an existing application to MSG_ZEROCOPY is not always as
trivial as just passing the flag, then.
More Info
---------
Much of this document was derived from a longer paper presented at
netdev 2.1. For more in-depth information see that paper and talk,
the excellent reporting over at LWN.net or read the original code.
paper, slides, video
https://netdevconf.org/2.1/session.html?debruijn
LWN article
https://lwn.net/Articles/726917/
patchset
[PATCH net-next v4 0/9] socket sendmsg MSG_ZEROCOPY
http://lkml.kernel.org/r/20170803202945.70750-1-willemdebruijn.kernel@gmail.com
Interface
=========
Passing the MSG_ZEROCOPY flag is the most obvious step to enable copy
avoidance, but not the only one.
Socket Setup
------------
The kernel is permissive when applications pass undefined flags to the
send system call. By default it simply ignores these. To avoid enabling
copy avoidance mode for legacy processes that accidentally already pass
this flag, a process must first signal intent by setting a socket option:
::
if (setsockopt(fd, SOL_SOCKET, SO_ZEROCOPY, &one, sizeof(one)))
error(1, errno, "setsockopt zerocopy");
Transmission
------------
The change to send (or sendto, sendmsg, sendmmsg) itself is trivial.
Pass the new flag.
::
ret = send(fd, buf, sizeof(buf), MSG_ZEROCOPY);
A zerocopy failure will return -1 with errno ENOBUFS. This happens if
the socket option was not set, the socket exceeds its optmem limit or
the user exceeds its ulimit on locked pages.
Mixing copy avoidance and copying
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Many workloads have a mixture of large and small buffers. Because copy
avoidance is more expensive than copying for small packets, the
feature is implemented as a flag. It is safe to mix calls with the flag
with those without.
Notifications
-------------
The kernel has to notify the process when it is safe to reuse a
previously passed buffer. It queues completion notifications on the
socket error queue, akin to the transmit timestamping interface.
The notification itself is a simple scalar value. Each socket
maintains an internal unsigned 32-bit counter. Each send call with
MSG_ZEROCOPY that successfully sends data increments the counter. The
counter is not incremented on failure or if called with length zero.
The counter counts system call invocations, not bytes. It wraps after
UINT_MAX calls.
Notification Reception
~~~~~~~~~~~~~~~~~~~~~~
The below snippet demonstrates the API. In the simplest case, each
send syscall is followed by a poll and recvmsg on the error queue.
Reading from the error queue is always a non-blocking operation. The
poll call is there to block until an error is outstanding. It will set
POLLERR in its output flags. That flag does not have to be set in the
events field. Errors are signaled unconditionally.
::
pfd.fd = fd;
pfd.events = 0;
if (poll(&pfd, 1, -1) != 1 || pfd.revents & POLLERR == 0)
error(1, errno, "poll");
ret = recvmsg(fd, &msg, MSG_ERRQUEUE);
if (ret == -1)
error(1, errno, "recvmsg");
read_notification(msg);
The example is for demonstration purpose only. In practice, it is more
efficient to not wait for notifications, but read without blocking
every couple of send calls.
Notifications can be processed out of order with other operations on
the socket. A socket that has an error queued would normally block
other operations until the error is read. Zerocopy notifications have
a zero error code, however, to not block send and recv calls.
Notification Batching
~~~~~~~~~~~~~~~~~~~~~
Multiple outstanding packets can be read at once using the recvmmsg
call. This is often not needed. In each message the kernel returns not
a single value, but a range. It coalesces consecutive notifications
while one is outstanding for reception on the error queue.
When a new notification is about to be queued, it checks whether the
new value extends the range of the notification at the tail of the
queue. If so, it drops the new notification packet and instead increases
the range upper value of the outstanding notification.
For protocols that acknowledge data in-order, like TCP, each
notification can be squashed into the previous one, so that no more
than one notification is outstanding at any one point.
Ordered delivery is the common case, but not guaranteed. Notifications
may arrive out of order on retransmission and socket teardown.
Notification Parsing
~~~~~~~~~~~~~~~~~~~~
The below snippet demonstrates how to parse the control message: the
read_notification() call in the previous snippet. A notification
is encoded in the standard error format, sock_extended_err.
The level and type fields in the control data are protocol family
specific, IP_RECVERR or IPV6_RECVERR.
Error origin is the new type SO_EE_ORIGIN_ZEROCOPY. ee_errno is zero,
as explained before, to avoid blocking read and write system calls on
the socket.
The 32-bit notification range is encoded as [ee_info, ee_data]. This
range is inclusive. Other fields in the struct must be treated as
undefined, bar for ee_code, as discussed below.
::
struct sock_extended_err *serr;
struct cmsghdr *cm;
cm = CMSG_FIRSTHDR(msg);
if (cm->cmsg_level != SOL_IP &&
cm->cmsg_type != IP_RECVERR)
error(1, 0, "cmsg");
serr = (void *) CMSG_DATA(cm);
if (serr->ee_errno != 0 ||
serr->ee_origin != SO_EE_ORIGIN_ZEROCOPY)
error(1, 0, "serr");
printf("completed: %u..%u\n", serr->ee_info, serr->ee_data);
Deferred copies
~~~~~~~~~~~~~~~
Passing flag MSG_ZEROCOPY is a hint to the kernel to apply copy
avoidance, and a contract that the kernel will queue a completion
notification. It is not a guarantee that the copy is elided.
Copy avoidance is not always feasible. Devices that do not support
scatter-gather I/O cannot send packets made up of kernel generated
protocol headers plus zerocopy user data. A packet may need to be
converted to a private copy of data deep in the stack, say to compute
a checksum.
In all these cases, the kernel returns a completion notification when
it releases its hold on the shared pages. That notification may arrive
before the (copied) data is fully transmitted. A zerocopy completion
notification is not a transmit completion notification, therefore.
Deferred copies can be more expensive than a copy immediately in the
system call, if the data is no longer warm in the cache. The process
also incurs notification processing cost for no benefit. For this
reason, the kernel signals if data was completed with a copy, by
setting flag SO_EE_CODE_ZEROCOPY_COPIED in field ee_code on return.
A process may use this signal to stop passing flag MSG_ZEROCOPY on
subsequent requests on the same socket.
Implementation
==============
Loopback
--------
Data sent to local sockets can be queued indefinitely if the receive
process does not read its socket. Unbound notification latency is not
acceptable. For this reason all packets generated with MSG_ZEROCOPY
that are looped to a local socket will incur a deferred copy. This
includes looping onto packet sockets (e.g., tcpdump) and tun devices.
Testing
=======
More realistic example code can be found in the kernel source under
tools/testing/selftests/net/msg_zerocopy.c.
Be cognizant of the loopback constraint. The test can be run between
a pair of hosts. But if run between a local pair of processes, for
instance when run with msg_zerocopy.sh between a veth pair across
namespaces, the test will not show any improvement. For testing, the
loopback restriction can be temporarily relaxed by making
skb_orphan_frags_rx identical to skb_orphan_frags.
......@@ -111,6 +111,14 @@ A: Generally speaking, the patches get triaged quickly (in less than 48h).
patch is a good way to ensure your patch is ignored or pushed to
the bottom of the priority list.
Q: I submitted multiple versions of the patch series, should I directly update
patchwork for the previous versions of these patch series?
A: No, please don't interfere with the patch status on patchwork, leave it to
the maintainer to figure out what is the most recent and current version that
should be applied. If there is any doubt, the maintainer will reply and ask
what should be done.
Q: How can I tell what patches are queued up for backporting to the
various stable releases?
......
Hyper-V network driver
======================
Compatibility
=============
This driver is compatible with Windows Server 2012 R2, 2016 and
Windows 10.
Features
========
Checksum offload
----------------
The netvsc driver supports checksum offload as long as the
Hyper-V host version does. Windows Server 2016 and Azure
support checksum offload for TCP and UDP for both IPv4 and
IPv6. Windows Server 2012 only supports checksum offload for TCP.
Receive Side Scaling
--------------------
Hyper-V supports receive side scaling. For TCP, packets are
distributed among available queues based on IP address and port
number.
For UDP, we can switch UDP hash level between L3 and L4 by ethtool
command. UDP over IPv4 and v6 can be set differently. The default
hash level is L4. We currently only allow switching TX hash level
from within the guests.
On Azure, fragmented UDP packets have high loss rate with L4
hashing. Using L3 hashing is recommended in this case.
For example, for UDP over IPv4 on eth0:
To include UDP port numbers in hashing:
ethtool -N eth0 rx-flow-hash udp4 sdfn
To exclude UDP port numbers in hashing:
ethtool -N eth0 rx-flow-hash udp4 sd
To show UDP hash level:
ethtool -n eth0 rx-flow-hash udp4
Generic Receive Offload, aka GRO
--------------------------------
The driver supports GRO and it is enabled by default. GRO coalesces
like packets and significantly reduces CPU usage under heavy Rx
load.
SR-IOV support
--------------
Hyper-V supports SR-IOV as a hardware acceleration option. If SR-IOV
is enabled in both the vSwitch and the guest configuration, then the
Virtual Function (VF) device is passed to the guest as a PCI
device. In this case, both a synthetic (netvsc) and VF device are
visible in the guest OS and both NIC's have the same MAC address.
The VF is enslaved by netvsc device. The netvsc driver will transparently
switch the data path to the VF when it is available and up.
Network state (addresses, firewall, etc) should be applied only to the
netvsc device; the slave device should not be accessed directly in
most cases. The exceptions are if some special queue discipline or
flow direction is desired, these should be applied directly to the
VF slave device.
Receive Buffer
--------------
Packets are received into a receive area which is created when device
is probed. The receive area is broken into MTU sized chunks and each may
contain one or more packets. The number of receive sections may be changed
via ethtool Rx ring parameters.
There is a similar send buffer which is used to aggregate packets for sending.
The send area is broken into chunks of 6144 bytes, each of section may
contain one or more packets. The send buffer is an optimization, the driver
will use slower method to handle very large packets or if the send buffer
area is exhausted.
......@@ -96,17 +96,6 @@ nf_conntrack_max - INTEGER
Size of connection tracking table. Default value is
nf_conntrack_buckets value * 4.
nf_conntrack_default_on - BOOLEAN
0 - don't register conntrack in new net namespaces
1 - register conntrack in new net namespaces (default)
This controls wheter newly created network namespaces have connection
tracking enabled by default. It will be enabled automatically
regardless of this setting if the new net namespace requires
connection tracking, e.g. when NAT rules are created.
This setting is only visible in initial user namespace, it has no
effect on existing namespaces.
nf_conntrack_tcp_be_liberal - BOOLEAN
0 - disabled (default)
not 0 - enabled
......
1. Introduction
rmnet driver is used for supporting the Multiplexing and aggregation
Protocol (MAP). This protocol is used by all recent chipsets using Qualcomm
Technologies, Inc. modems.
This driver can be used to register onto any physical network device in
IP mode. Physical transports include USB, HSIC, PCIe and IP accelerator.
Multiplexing allows for creation of logical netdevices (rmnet devices) to
handle multiple private data networks (PDN) like a default internet, tethering,
multimedia messaging service (MMS) or IP media subsystem (IMS). Hardware sends
packets with MAP headers to rmnet. Based on the multiplexer id, rmnet
routes to the appropriate PDN after removing the MAP header.
Aggregation is required to achieve high data rates. This involves hardware
sending aggregated bunch of MAP frames. rmnet driver will de-aggregate
these MAP frames and send them to appropriate PDN's.
2. Packet format
a. MAP packet (data / control)
MAP header has the same endianness of the IP packet.
Packet format -
Bit 0 1 2-7 8 - 15 16 - 31
Function Command / Data Reserved Pad Multiplexer ID Payload length
Bit 32 - x
Function Raw Bytes
Command (1)/ Data (0) bit value is to indicate if the packet is a MAP command
or data packet. Control packet is used for transport level flow control. Data
packets are standard IP packets.
Reserved bits are usually zeroed out and to be ignored by receiver.
Padding is number of bytes to be added for 4 byte alignment if required by
hardware.
Multiplexer ID is to indicate the PDN on which data has to be sent.
Payload length includes the padding length but does not include MAP header
length.
b. MAP packet (command specific)
Bit 0 1 2-7 8 - 15 16 - 31
Function Command Reserved Pad Multiplexer ID Payload length
Bit 32 - 39 40 - 45 46 - 47 48 - 63
Function Command name Reserved Command Type Reserved
Bit 64 - 95
Function Transaction ID
Bit 96 - 127
Function Command data
Command 1 indicates disabling flow while 2 is enabling flow
Command types -
0 for MAP command request
1 is to acknowledge the receipt of a command
2 is for unsupported commands
3 is for error during processing of commands
c. Aggregation
Aggregation is multiple MAP packets (can be data or command) delivered to
rmnet in a single linear skb. rmnet will process the individual
packets and either ACK the MAP command or deliver the IP packet to the
network stack as needed
MAP header|IP Packet|Optional padding|MAP header|IP Packet|Optional padding....
MAP header|IP Packet|Optional padding|MAP header|Command Packet|Optional pad...
3. Userspace configuration
rmnet userspace configuration is done through netlink library librmnetctl
and command line utility rmnetcli. Utility is hosted in codeaurora forum git.
The driver uses rtnl_link_ops for communication.
https://source.codeaurora.org/quic/la/platform/vendor/qcom-opensource/dataservices/tree/rmnetctl
......@@ -818,10 +818,15 @@ The kernel interface functions are as follows:
(*) Send data through a call.
typedef void (*rxrpc_notify_end_tx_t)(struct sock *sk,
unsigned long user_call_ID,
struct sk_buff *skb);
int rxrpc_kernel_send_data(struct socket *sock,
struct rxrpc_call *call,
struct msghdr *msg,
size_t len);
size_t len,
rxrpc_notify_end_tx_t notify_end_rx);
This is used to supply either the request part of a client call or the
reply part of a server call. msg.msg_iovlen and msg.msg_iov specify the
......@@ -832,6 +837,11 @@ The kernel interface functions are as follows:
The msg must not specify a destination address, control data or any flags
other than MSG_MORE. len is the total amount of data to transmit.
notify_end_rx can be NULL or it can be used to specify a function to be
called when the call changes state to end the Tx phase. This function is
called with the call-state spinlock held to prevent any reply or final ACK
from being delivered first.
(*) Receive data from a call.
int rxrpc_kernel_recv_data(struct socket *sock,
......@@ -965,6 +975,51 @@ The kernel interface functions are as follows:
size should be set when the call is begun. tx_total_len may not be less
than zero.
(*) Check to see the completion state of a call so that the caller can assess
whether it needs to be retried.
enum rxrpc_call_completion {
RXRPC_CALL_SUCCEEDED,
RXRPC_CALL_REMOTELY_ABORTED,
RXRPC_CALL_LOCALLY_ABORTED,
RXRPC_CALL_LOCAL_ERROR,
RXRPC_CALL_NETWORK_ERROR,
};
int rxrpc_kernel_check_call(struct socket *sock, struct rxrpc_call *call,
enum rxrpc_call_completion *_compl,
u32 *_abort_code);
On return, -EINPROGRESS will be returned if the call is still ongoing; if
it is finished, *_compl will be set to indicate the manner of completion,
*_abort_code will be set to any abort code that occurred. 0 will be
returned on a successful completion, -ECONNABORTED will be returned if the
client failed due to a remote abort and anything else will return an
appropriate error code.
The caller should look at this information to decide if it's worth
retrying the call.
(*) Retry a client call.
int rxrpc_kernel_retry_call(struct socket *sock,
struct rxrpc_call *call,
struct sockaddr_rxrpc *srx,
struct key *key);
This attempts to partially reinitialise a call and submit it again whilst
reusing the original call's Tx queue to avoid the need to repackage and
re-encrypt the data to be sent. call indicates the call to retry, srx the
new address to send it to and key the encryption key to use for signing or
encrypting the packets.
For this to work, the first Tx data packet must still be in the transmit
queue, and currently this is only permitted for local and network errors
and the call must not have been aborted. Any partially constructed Tx
packet is left as is and can continue being filled afterwards.
It returns 0 if the call was requeued and an error otherwise.
=======================
CONFIGURABLE PARAMETERS
......
Stream Parser
-------------
Stream Parser (strparser)
Introduction
============
The stream parser (strparser) is a utility that parses messages of an
application layer protocol running over a TCP connection. The stream
application layer protocol running over a data stream. The stream
parser works in conjunction with an upper layer in the kernel to provide
kernel support for application layer messages. For instance, Kernel
Connection Multiplexor (KCM) uses the Stream Parser to parse messages
using a BPF program.
The strparser works in one of two modes: receive callback or general
mode.
In receive callback mode, the strparser is called from the data_ready
callback of a TCP socket. Messages are parsed and delivered as they are
received on the socket.
In general mode, a sequence of skbs are fed to strparser from an
outside source. Message are parsed and delivered as the sequence is
processed. This modes allows strparser to be applied to arbitrary
streams of data.
Interface
---------
=========
The API includes a context structure, a set of callbacks, utility
functions, and a data_ready function. The callbacks include
a parse_msg function that is called to perform parsing (e.g.
BPF parsing in case of KCM), and a rcv_msg function that is called
when a full message has been completed.
functions, and a data_ready function for receive callback mode. The
callbacks include a parse_msg function that is called to perform
parsing (e.g. BPF parsing in case of KCM), and a rcv_msg function
that is called when a full message has been completed.
Functions
=========
strp_init(struct strparser *strp, struct sock *sk,
const struct strp_callbacks *cb)
Called to initialize a stream parser. strp is a struct of type
strparser that is allocated by the upper layer. sk is the TCP
socket associated with the stream parser for use with receive
callback mode; in general mode this is set to NULL. Callbacks
are called by the stream parser (the callbacks are listed below).
void strp_pause(struct strparser *strp)
Temporarily pause a stream parser. Message parsing is suspended
and no new messages are delivered to the upper layer.
void strp_pause(struct strparser *strp)
Unpause a paused stream parser.
void strp_stop(struct strparser *strp);
strp_stop is called to completely stop stream parser operations.
This is called internally when the stream parser encounters an
error, and it is called from the upper layer to stop parsing
operations.
void strp_done(struct strparser *strp);
strp_done is called to release any resources held by the stream
parser instance. This must be called after the stream processor
has been stopped.
A stream parser can be instantiated for a TCP connection. This is done
by:
int strp_process(struct strparser *strp, struct sk_buff *orig_skb,
unsigned int orig_offset, size_t orig_len,
size_t max_msg_size, long timeo)
strp_init(struct strparser *strp, struct sock *csk,
struct strp_callbacks *cb)
strp_process is called in general mode for a stream parser to
parse an sk_buff. The number of bytes processed or a negative
error number is returned. Note that strp_process does not
consume the sk_buff. max_msg_size is maximum size the stream
parser will parse. timeo is timeout for completing a message.
strp is a struct of type strparser that is allocated by the upper layer.
csk is the TCP socket associated with the stream parser. Callbacks are
called by the stream parser.
void strp_data_ready(struct strparser *strp);
The upper layer calls strp_tcp_data_ready when data is ready on
the lower socket for strparser to process. This should be called
from a data_ready callback that is set on the socket. Note that
maximum messages size is the limit of the receive socket
buffer and message timeout is the receive timeout for the socket.
void strp_check_rcv(struct strparser *strp);
strp_check_rcv is called to check for new messages on the socket.
This is normally called at initialization of a stream parser
instance or after strp_unpause.
Callbacks
---------
=========
There are four callbacks:
There are six callbacks:
int (*parse_msg)(struct strparser *strp, struct sk_buff *skb);
parse_msg is called to determine the length of the next message
in the stream. The upper layer must implement this function. It
should parse the sk_buff as containing the headers for the
next application layer messages in the stream.
next application layer message in the stream.
The skb->cb in the input skb is a struct strp_rx_msg. Only
The skb->cb in the input skb is a struct strp_msg. Only
the offset field is relevant in parse_msg and gives the offset
where the message starts in the skb.
......@@ -50,26 +112,41 @@ int (*parse_msg)(struct strparser *strp, struct sk_buff *skb);
-ESTRPIPE : current message should not be processed by the
kernel, return control of the socket to userspace which
can proceed to read the messages itself
other < 0 : Error is parsing, give control back to userspace
other < 0 : Error in parsing, give control back to userspace
assuming that synchronization is lost and the stream
is unrecoverable (application expected to close TCP socket)
In the case that an error is returned (return value is less than
zero) the stream parser will set the error on TCP socket and wake
it up. If parse_msg returned -ESTRPIPE and the stream parser had
previously read some bytes for the current message, then the error
set on the attached socket is ENODATA since the stream is
unrecoverable in that case.
zero) and the parser is in receive callback mode, then it will set
the error on TCP socket and wake it up. If parse_msg returned
-ESTRPIPE and the stream parser had previously read some bytes for
the current message, then the error set on the attached socket is
ENODATA since the stream is unrecoverable in that case.
void (*lock)(struct strparser *strp)
The lock callback is called to lock the strp structure when
the strparser is performing an asynchronous operation (such as
processing a timeout). In receive callback mode the default
function is to lock_sock for the associated socket. In general
mode the callback must be set appropriately.
void (*unlock)(struct strparser *strp)
The unlock callback is called to release the lock obtained
by the lock callback. In receive callback mode the default
function is release_sock for the associated socket. In general
mode the callback must be set appropriately.
void (*rcv_msg)(struct strparser *strp, struct sk_buff *skb);
rcv_msg is called when a full message has been received and
is queued. The callee must consume the sk_buff; it can
call strp_pause to prevent any further messages from being
received in rcv_msg (see strp_pause below). This callback
received in rcv_msg (see strp_pause above). This callback
must be set.
The skb->cb in the input skb is a struct strp_rx_msg. This
The skb->cb in the input skb is a struct strp_msg. This
struct contains two fields: offset and full_len. Offset is
where the message starts in the skb, and full_len is the
the length of the message. skb->len - offset may be greater
......@@ -78,59 +155,53 @@ void (*rcv_msg)(struct strparser *strp, struct sk_buff *skb);
int (*read_sock_done)(struct strparser *strp, int err);
read_sock_done is called when the stream parser is done reading
the TCP socket. The stream parser may read multiple messages
in a loop and this function allows cleanup to occur when existing
the loop. If the callback is not set (NULL in strp_init) a
default function is used.
the TCP socket in receive callback mode. The stream parser may
read multiple messages in a loop and this function allows cleanup
to occur when exiting the loop. If the callback is not set (NULL
in strp_init) a default function is used.
void (*abort_parser)(struct strparser *strp, int err);
This function is called when stream parser encounters an error
in parsing. The default function stops the stream parser for the
TCP socket and sets the error in the socket. The default function
can be changed by setting the callback to non-NULL in strp_init.
in parsing. The default function stops the stream parser and
sets the error in the socket if the parser is in receive callback
mode. The default function can be changed by setting the callback
to non-NULL in strp_init.
Functions
---------
Statistics
==========
The upper layer calls strp_tcp_data_ready when data is ready on the lower
socket for strparser to process. This should be called from a data_ready
callback that is set on the socket.
Various counters are kept for each stream parser instance. These are in
the strp_stats structure. strp_aggr_stats is a convenience structure for
accumulating statistics for multiple stream parser instances.
save_strp_stats and aggregate_strp_stats are helper functions to save
and aggregate statistics.
strp_stop is called to completely stop stream parser operations. This
is called internally when the stream parser encounters an error, and
it is called from the upper layer when unattaching a TCP socket.
Message assembly limits
=======================
strp_done is called to unattach the stream parser from the TCP socket.
This must be called after the stream processor has be stopped.
The stream parser provide mechanisms to limit the resources consumed by
message assembly.
strp_check_rcv is called to check for new messages on the socket. This
is normally called at initialization of the a stream parser instance
of after strp_unpause.
A timer is set when assembly starts for a new message. In receive
callback mode the message timeout is taken from rcvtime for the
associated TCP socket. In general mode, the timeout is passed as an
argument in strp_process. If the timer fires before assembly completes
the stream parser is aborted and the ETIMEDOUT error is set on the TCP
socket if in receive callback mode.
Statistics
----------
In receive callback mode, message length is limited to the receive
buffer size of the associated TCP socket. If the length returned by
parse_msg is greater than the socket buffer size then the stream parser
is aborted with EMSGSIZE error set on the TCP socket. Note that this
makes the maximum size of receive skbuffs for a socket with a stream
parser to be 2*sk_rcvbuf of the TCP socket.
Various counters are kept for each stream parser for a TCP socket.
These are in the strp_stats structure. strp_aggr_stats is a convenience
structure for accumulating statistics for multiple stream parser
instances. save_strp_stats and aggregate_strp_stats are helper functions
to save and aggregate statistics.
In general mode the message length limit is passed in as an argument
to strp_process.
Message assembly limits
-----------------------
Author
======
The stream parser provide mechanisms to limit the resources consumed by
message assembly.
Tom Herbert (tom@quantonium.net)
A timer is set when assembly starts for a new message. The message
timeout is taken from rcvtime for the associated TCP socket. If the
timer fires before assembly completes the stream parser is aborted
and the ETIMEDOUT error is set on the TCP socket.
Message length is limited to the receive buffer size of the associated
TCP socket. If the length returned by parse_msg is greater than
the socket buffer size then the stream parser is aborted with
EMSGSIZE error set on the TCP socket. Note that this makes the
maximum size of receive skbuffs for a socket with a stream parser
to be 2*sk_rcvbuf of the TCP socket.
......@@ -46,13 +46,13 @@ translate these BPF proglets into native CPU instructions. There are
two flavors of JITs, the newer eBPF JIT currently supported on:
- x86_64
- arm64
- arm32
- ppc64
- sparc64
- mips64
- s390x
And the older cBPF JIT supported on the following archs:
- arm
- mips
- ppc
- sparc
......
......@@ -2492,7 +2492,7 @@ Q: https://patchwork.open-mesh.org/project/batman/list/
S: Maintained
F: Documentation/ABI/testing/sysfs-class-net-batman-adv
F: Documentation/ABI/testing/sysfs-class-net-mesh
F: Documentation/networking/batman-adv.txt
F: Documentation/networking/batman-adv.rst
F: include/uapi/linux/batman_adv.h
F: net/batman-adv/
......@@ -5136,6 +5136,7 @@ F: include/linux/of_net.h
F: include/linux/phy.h
F: include/linux/phy_fixed.h
F: include/linux/platform_data/mdio-gpio.h
F: include/linux/platform_data/mdio-bcm-unimac.h
F: include/trace/events/mdio.h
F: include/uapi/linux/mdio.h
F: include/uapi/linux/mii.h
......@@ -6183,6 +6184,14 @@ S: Maintained
F: drivers/net/ethernet/hisilicon/
F: Documentation/devicetree/bindings/net/hisilicon*.txt
HISILICON NETWORK SUBSYSTEM 3 DRIVER (HNS3)
M: Yisen Zhuang <yisen.zhuang@huawei.com>
M: Salil Mehta <salil.mehta@huawei.com>
L: netdev@vger.kernel.org
W: http://www.hisilicon.com
S: Maintained
F: drivers/net/ethernet/hisilicon/hns3/
HISILICON ROCE DRIVER
M: Lijun Ou <oulijun@huawei.com>
M: Wei Hu(Xavier) <xavier.huwei@huawei.com>
......@@ -6267,6 +6276,13 @@ L: linux-input@vger.kernel.org
S: Maintained
F: drivers/input/touchscreen/htcpen.c
HUAWEI ETHERNET DRIVER
M: Aviad Krawczyk <aviad.krawczyk@huawei.com>
L: netdev@vger.kernel.org
S: Supported
F: Documentation/networking/hinic.txt
F: drivers/net/ethernet/huawei/hinic/
HUGETLB FILESYSTEM
M: Nadia Yvette Chambers <nyc@holomorphy.com>
S: Maintained
......@@ -6293,6 +6309,7 @@ M: Haiyang Zhang <haiyangz@microsoft.com>
M: Stephen Hemminger <sthemmin@microsoft.com>
L: devel@linuxdriverproject.org
S: Maintained
F: Documentation/networking/netvsc.txt
F: arch/x86/include/asm/mshyperv.h
F: arch/x86/include/uapi/asm/hyperv.h
F: arch/x86/kernel/cpu/mshyperv.c
......@@ -6305,6 +6322,7 @@ F: drivers/net/hyperv/
F: drivers/scsi/storvsc_drv.c
F: drivers/uio/uio_hv_generic.c
F: drivers/video/fbdev/hyperv_fb.c
F: net/vmw_vsock/hyperv_transport.c
F: include/linux/hyperv.h
F: tools/hv/
F: Documentation/ABI/stable/sysfs-bus-vmbus
......@@ -7120,9 +7138,7 @@ W: http://irda.sourceforge.net/
S: Maintained
T: git git://git.kernel.org/pub/scm/linux/kernel/git/sameo/irda-2.6.git
F: Documentation/networking/irda.txt
F: drivers/net/irda/
F: include/net/irda/
F: net/irda/
F: drivers/staging/irda/
IRQ DOMAINS (IRQ NUMBER MAPPING LIBRARY)
M: Marc Zyngier <marc.zyngier@arm.com>
......@@ -8449,7 +8465,9 @@ F: include/uapi/linux/uvcvideo.h
MEDIATEK ETHERNET DRIVER
M: Felix Fietkau <nbd@openwrt.org>
M: John Crispin <blogic@openwrt.org>
M: John Crispin <john@phrozen.org>
M: Sean Wang <sean.wang@mediatek.com>
M: Nelson Chang <nelson.chang@mediatek.com>
L: netdev@vger.kernel.org
S: Maintained
F: drivers/net/ethernet/mediatek/
......
......@@ -109,4 +109,6 @@
#define SO_PEERGROUPS 59
#define SO_ZEROCOPY 60
#endif /* _UAPI_ASM_SOCKET_H */
......@@ -50,7 +50,7 @@ config ARM
select HAVE_ARCH_SECCOMP_FILTER if (AEABI && !OABI_COMPAT)
select HAVE_ARCH_TRACEHOOK
select HAVE_ARM_SMCCC if CPU_V7
select HAVE_CBPF_JIT
select HAVE_EBPF_JIT if !CPU_ENDIAN_BE32
select HAVE_CC_STACKPROTECTOR
select HAVE_CONTEXT_TRACKING
select HAVE_C_RECORDMCOUNT
......
......@@ -50,6 +50,16 @@
device_type = "memory";
reg = <0x60000000 0x40000000>;
};
vcc_phy: vcc-phy-regulator {
compatible = "regulator-fixed";
enable-active-high;
regulator-name = "vcc_phy";
regulator-min-microvolt = <1800000>;
regulator-max-microvolt = <1800000>;
regulator-always-on;
regulator-boot-on;
};
};
&emmc {
......@@ -60,6 +70,30 @@
status = "okay";
};
&gmac {
assigned-clocks = <&cru SCLK_MAC_SRC>;
assigned-clock-rates = <50000000>;
clock_in_out = "output";
phy-supply = <&vcc_phy>;
phy-mode = "rmii";
phy-handle = <&phy>;
status = "okay";
mdio {
compatible = "snps,dwmac-mdio";
#address-cells = <1>;
#size-cells = <0>;
phy: phy@0 {
compatible = "ethernet-phy-id1234.d400", "ethernet-phy-ieee802.3-c22";
reg = <0>;
clocks = <&cru SCLK_MAC_PHY>;
resets = <&cru SRST_MACPHY>;
phy-is-integrated;
};
};
};
&tsadc {
status = "okay";
......
......@@ -270,6 +270,7 @@ CONFIG_ICPLUS_PHY=y
CONFIG_REALTEK_PHY=y
CONFIG_MICREL_PHY=y
CONFIG_FIXED_PHY=y
CONFIG_ROCKCHIP_PHY=y
CONFIG_USB_PEGASUS=y
CONFIG_USB_RTL8152=m
CONFIG_USB_USBNET=y
......
此差异已折叠。
此差异已折叠。
......@@ -50,6 +50,23 @@
chosen {
stdout-path = "serial2:1500000n8";
};
vcc_phy: vcc-phy-regulator {
compatible = "regulator-fixed";
regulator-name = "vcc_phy";
regulator-always-on;
regulator-boot-on;
};
};
&gmac2phy {
phy-supply = <&vcc_phy>;
clock_in_out = "output";
assigned-clocks = <&cru SCLK_MAC2PHY_SRC>;
assigned-clock-rate = <50000000>;
assigned-clocks = <&cru SCLK_MAC2PHY>;
assigned-clock-parents = <&cru SCLK_MAC2PHY_SRC>;
status = "okay";
};
&uart2 {
......
......@@ -63,6 +63,8 @@
i2c1 = &i2c1;
i2c2 = &i2c2;
i2c3 = &i2c3;
ethernet0 = &gmac2io;
ethernet1 = &gmac2phy;
};
cpus {
......@@ -424,6 +426,43 @@
status = "disabled";
};
gmac2phy: ethernet@ff550000 {
compatible = "rockchip,rk3328-gmac";
reg = <0x0 0xff550000 0x0 0x10000>;
rockchip,grf = <&grf>;
interrupts = <GIC_SPI 21 IRQ_TYPE_LEVEL_HIGH>;
interrupt-names = "macirq";
clocks = <&cru SCLK_MAC2PHY_SRC>, <&cru SCLK_MAC2PHY_RXTX>,
<&cru SCLK_MAC2PHY_RXTX>, <&cru SCLK_MAC2PHY_REF>,
<&cru ACLK_MAC2PHY>, <&cru PCLK_MAC2PHY>,
<&cru SCLK_MAC2PHY_OUT>;
clock-names = "stmmaceth", "mac_clk_rx",
"mac_clk_tx", "clk_mac_ref",
"aclk_mac", "pclk_mac",
"clk_macphy";
resets = <&cru SRST_GMAC2PHY_A>, <&cru SRST_MACPHY>;
reset-names = "stmmaceth", "mac-phy";
phy-mode = "rmii";
phy-handle = <&phy>;
status = "disabled";
mdio {
compatible = "snps,dwmac-mdio";
#address-cells = <1>;
#size-cells = <0>;
phy: phy@0 {
compatible = "ethernet-phy-id1234.d400", "ethernet-phy-ieee802.3-c22";
reg = <0>;
clocks = <&cru SCLK_MAC2PHY_OUT>;
resets = <&cru SRST_MACPHY>;
pinctrl-names = "default";
pinctrl-0 = <&fephyled_rxm1 &fephyled_linkm1>;
phy-is-integrated;
};
};
};
gic: interrupt-controller@ff811000 {
compatible = "arm,gic-400";
#interrupt-cells = <3>;
......
......@@ -203,6 +203,7 @@ CONFIG_MARVELL_PHY=m
CONFIG_MESON_GXL_PHY=m
CONFIG_MICREL_PHY=y
CONFIG_REALTEK_PHY=m
CONFIG_ROCKCHIP_PHY=y
CONFIG_USB_PEGASUS=m
CONFIG_USB_RTL8150=m
CONFIG_USB_RTL8152=m
......
......@@ -44,8 +44,12 @@
#define A64_COND_NE AARCH64_INSN_COND_NE /* != */
#define A64_COND_CS AARCH64_INSN_COND_CS /* unsigned >= */
#define A64_COND_HI AARCH64_INSN_COND_HI /* unsigned > */
#define A64_COND_LS AARCH64_INSN_COND_LS /* unsigned <= */
#define A64_COND_CC AARCH64_INSN_COND_CC /* unsigned < */
#define A64_COND_GE AARCH64_INSN_COND_GE /* signed >= */
#define A64_COND_GT AARCH64_INSN_COND_GT /* signed > */
#define A64_COND_LE AARCH64_INSN_COND_LE /* signed <= */
#define A64_COND_LT AARCH64_INSN_COND_LT /* signed < */
#define A64_B_(cond, imm19) A64_COND_BRANCH(cond, (imm19) << 2)
/* Unconditional branch (immediate) */
......
此差异已折叠。
......@@ -102,5 +102,7 @@
#define SO_PEERGROUPS 59
#define SO_ZEROCOPY 60
#endif /* _ASM_SOCKET_H */
......@@ -111,4 +111,6 @@
#define SO_PEERGROUPS 59
#define SO_ZEROCOPY 60
#endif /* _ASM_IA64_SOCKET_H */
......@@ -102,4 +102,6 @@
#define SO_PEERGROUPS 59
#define SO_ZEROCOPY 60
#endif /* _ASM_M32R_SOCKET_H */
......@@ -120,4 +120,6 @@
#define SO_PEERGROUPS 59
#define SO_ZEROCOPY 60
#endif /* _UAPI_ASM_SOCKET_H */
此差异已折叠。
......@@ -102,4 +102,6 @@
#define SO_PEERGROUPS 59
#define SO_ZEROCOPY 60
#endif /* _ASM_SOCKET_H */
......@@ -31,7 +31,6 @@ CONFIG_IP_PNP_BOOTP=y
CONFIG_INET6_IPCOMP=m
CONFIG_IPV6_TUNNEL=m
CONFIG_NETFILTER=y
CONFIG_NETFILTER_DEBUG=y
CONFIG_NET_PKTGEN=m
CONFIG_UEVENT_HELPER_PATH="/sbin/hotplug"
CONFIG_DEVTMPFS=y
......
......@@ -101,4 +101,6 @@
#define SO_PEERGROUPS 0x4034
#define SO_ZEROCOPY 0x4035
#endif /* _UAPI_ASM_SOCKET_H */
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
文件已移动
文件已移动
文件已移动
文件已移动
文件已移动
文件已移动
文件已移动
文件已移动
文件已移动
文件已移动
文件已移动
文件已移动
文件已移动
文件已移动
此差异已折叠。
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册
反馈
建议
客服 返回
顶部