提交 · 42611b70f8be12c46906c869f5d819ac70dfd1c7 · openeuler / Kernel

10 8月, 2019 14 次提交

net: hns3: add check for max TX BD num for tso and non-tso case · 42611b70

由 Yunsheng Lin 提交于 8月 09, 2019

Hardware supports up to 8 TX BD for non-TSO skb and 63 TX
BD for TSO skb. Currently hns3 driver does not check the max
BD num that required by a skb before filling desc, which may
cause the hardware to issue a RAS error throug PCIe AER.

This patch adds the max BD num check before filling desc,
if the bd num is not within the hardware limit, it will
record the error by ring->stats.sw_err_cnt counter and
free the skb.

This patch also cleans up the hns3_nic_bd_num function by
changing the return type and removing an unnecessary check.
Signed-off-by: NYunsheng Lin <linyunsheng@huawei.com>
Reviewed-by: NPeng Li <lipeng321@huawei.com>
Signed-off-by: NHuazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

42611b70

net: hns3: add some statitics info to tx process · b20d7fe5

由 Yunsheng Lin 提交于 8月 09, 2019

This patch adds tx_vlan_err, tx_l4_proto_err, tx_l2l3l4_err
and tx_tso_err counter to tx process, in order to better
debug the desc filling error.

This patch also adds a missing u64_stats_update_* around
ring->stats.sw_err_cnt and adds hns3_rl_err to limit the
error printing in the IO patch.
Signed-off-by: NYunsheng Lin <linyunsheng@huawei.com>
Reviewed-by: NPeng Li <lipeng321@huawei.com>
Signed-off-by: NHuazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b20d7fe5

net: hns3: add DFX registers information for ethtool -d · ddb54554

由 Guangbin Huang 提交于 8月 09, 2019

Now we can use ethtool -d command to dump some registers. However,
these registers information is not enough to find out where the problem is.

This patch adds DFX registers information after original registers
when use ethtool -d commmand to dump registers. Also, using macro
replaces some related magic number.
Signed-off-by: NGuangbin Huang <huangguangbin2@huawei.com>
Reviewed-by: NPeng Li <lipeng321@huawei.com>
Signed-off-by: NHuazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ddb54554

net: hns3: modify how pause options is displayed · aacbe27e

由 Yonglong Liu 提交于 8月 09, 2019

Currently, the pause options of HNS3 shown like this:
"RX/TX" is always the same with "RX negotiated/TX negotiated".
Because of the driver covered the value of "RX/TX" with the value
of "RX negotiated/TX negotiated" after adjust link.

This patch records the pause configurations of the user, and never
covered them in adjust link.
Signed-off-by: NYonglong Liu <liuyonglong@huawei.com>
Reviewed-by: NYunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: NHuazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

aacbe27e

net: hns3: add input length check for debugfs write function · 7ac243f9

由 Yufeng Mo 提交于 8月 09, 2019

If the input length reaches the maximum value of size_t, the reverse is
triggered when 1 is added. In addition, there is no need to have such a
large length. Therefore, the input length should be checked and the value
should be less than or equal to 1024.
Signed-off-by: NYufeng Mo <moyufeng@huawei.com>
Reviewed-by: NPeng Li <lipeng321@huawei.com>
Signed-off-by: NHuazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7ac243f9

net: hns3: clean up for vlan handling in hns3_fill_desc_vtags · eb977d99

由 Yunsheng Lin 提交于 8月 09, 2019

This patch refactors the hns3_fill_desc_vtags function
by avoiding passing too many parameters, reducing indent
level and some other clean up.

This patch also adds the hns3_fill_skb_desc function to
fill the first desc of a skb.
Signed-off-by: NYunsheng Lin <linyunsheng@huawei.com>
Reviewed-by: NPeng Li <lipeng321@huawei.com>
Signed-off-by: NHuazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

eb977d99

net: hns3: fix interrupt clearing error for VF · 13050921

由 Huazhong Tan 提交于 8月 09, 2019

Currently, VF driver has two kinds of interrupts, reset & CMDQ RX.
For revision 0x21, according to the UM, each interrupt should be
cleared by write 0 to the corresponding bit, but the implementation
writes 0 to the whole register in fact, it will clear other
interrupt at the same time, then the VF will loss the interrupt.
But for revision 0x20, this interrupt clear register is a read &
write register, for compatible, we just keep the old implementation
for 0x20.

This patch fixes it, also, adds a new register for reading the interrupt
status according to hardware user manual.

Fixes: e2cb1dec ("net: hns3: Add HNS3 VF HCL(Hardware Compatibility Layer) Support")
Fixes: b90fcc5b ("net: hns3: add reset handling for VF when doing Core/Global/IMP reset")
Signed-off-by: NHuazhong Tan <tanhuazhong@huawei.com>
Reviewed-by: NYunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

13050921

net: hns3: fix GFP flag error in hclge_mac_update_stats() · 9e6717af

由 Zhongzhu Liu 提交于 8月 09, 2019

When CONFIG_DEBUG_ATOMIC_SLEEP on, calling kzalloc with
GFP_KERNEL in hclge_mac_update_stats() will get below warning:

[   52.514677] BUG: sleeping function called from invalid context at mm/slab.h:501
[   52.522051] in_atomic(): 0, irqs_disabled(): 0, pid: 1015, name: ifconfig
[   52.528827] 2 locks held by ifconfig/1015:
[   52.532921]  #0: (____ptrval____) (&p->lock){....}, at: seq_read+0x54/0x748
[   52.539878]  #1: (____ptrval____) (rcu_read_lock){....}, at: dev_seq_start+0x0/0x140
[   52.547610] CPU: 16 PID: 1015 Comm: ifconfig Not tainted 5.3.0-rc3-00697-g20b80be #98
[   52.555408] Hardware name: Huawei TaiShan 2280 V2/BC82AMDC, BIOS 2280-V2 CS V3.B050.01 08/08/2019
[   52.564242] Call trace:
[   52.566687]  dump_backtrace+0x0/0x1f8
[   52.570338]  show_stack+0x14/0x20
[   52.573646]  dump_stack+0xb4/0xec
[   52.576950]  ___might_sleep+0x178/0x198
[   52.580773]  __might_sleep+0x74/0xe0
[   52.584338]  __kmalloc+0x244/0x2d8
[   52.587744]  hclge_mac_update_stats+0xc8/0x1f8 [hclge]
[   52.592870]  hclge_update_stats+0xe0/0x170 [hclge]
[   52.597651]  hns3_nic_get_stats64+0xa0/0x458 [hns3]
[   52.602514]  dev_get_stats+0x58/0x138
[   52.606165]  dev_seq_printf_stats+0x8c/0x280
[   52.610420]  dev_seq_show+0x14/0x40
[   52.613898]  seq_read+0x574/0x748
[   52.617205]  proc_reg_read+0xb4/0x108
[   52.620857]  __vfs_read+0x54/0xa8
[   52.624162]  vfs_read+0xa0/0x190
[   52.627380]  ksys_read+0xc8/0x178
[   52.630685]  __arm64_sys_read+0x40/0x50
[   52.634509]  el0_svc_common.constprop.0+0x120/0x1e0
[   52.639369]  el0_svc_handler+0x50/0x90
[   52.643106]  el0_svc+0x8/0xc

So this patch uses GFP_ATOMIC instead of GFP_KERNEL to fix it.

Fixes: d174ea75 ("net: hns3: add statistics for PFC frames and MAC control frames")
Signed-off-by: NZhongzhu Liu <liuzhongzhu@huawei.com>
Reviewed-by: NYunsheng Lin <linyunsheng@huawei.com>
Reviewed-by: NPeng Li <lipeng321@huawei.com>
Signed-off-by: NHuazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9e6717af

taprio: remove unused variable 'entry_list_policy' · ca497fb6

由 YueHaibing 提交于 8月 09, 2019

net/sched/sch_taprio.c:680:32: warning:
 entry_list_policy defined but not used [-Wunused-const-variable=]

One of the points of commit a3d43c0d ("taprio: Add support adding
an admin schedule") is that it removes support (it now returns "not
supported") for schedules using the TCA_TAPRIO_ATTR_SCHED_SINGLE_ENTRY
attribute (which were never used), the parsing of those types of schedules
was the only user of this policy. So removing this policy should be fine.
Reported-by: NHulk Robot <hulkci@huawei.com>
Suggested-by: NVinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: NYueHaibing <yuehaibing@huawei.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ca497fb6

r8169: fix performance issue on RTL8168evl · a7eb6a4f

由 Holger Hoffstätte 提交于 8月 09, 2019

Disabling TSO but leaving SG active results is a significant
performance drop. Therefore disable also SG on RTL8168evl.
This restores the original performance.

Fixes: 93681cd7 ("r8169: enable HW csum and TSO")
Signed-off-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
Signed-off-by: NHeiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a7eb6a4f

tcp: Update TCP_BASE_MSS comment · 1555e6fd

由 Josh Hunt 提交于 8月 07, 2019

TCP_BASE_MSS is used as the default initial MSS value when MTU probing is
enabled. Update the comment to reflect this.
Suggested-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NJosh Hunt <johunt@akamai.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Acked-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1555e6fd

tcp: add new tcp_mtu_probe_floor sysctl · c04b79b6

由 Josh Hunt 提交于 8月 07, 2019

The current implementation of TCP MTU probing can considerably
underestimate the MTU on lossy connections allowing the MSS to get down to
48. We have found that in almost all of these cases on our networks these
paths can handle much larger MTUs meaning the connections are being
artificially limited. Even though TCP MTU probing can raise the MSS back up
we have seen this not to be the case causing connections to be "stuck" with
an MSS of 48 when heavy loss is present.

Prior to pushing out this change we could not keep TCP MTU probing enabled
b/c of the above reasons. Now with a reasonble floor set we've had it
enabled for the past 6 months.

The new sysctl will still default to TCP_MIN_SND_MSS (48), but gives
administrators the ability to control the floor of MSS probing.
Signed-off-by: NJosh Hunt <johunt@akamai.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Acked-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c04b79b6

devlink: remove pointless data_len arg from region snapshot create · 3a5e5234

由 Jiri Pirko 提交于 8月 09, 2019

The size of the snapshot has to be the same as the size of the region,
therefore no need to pass it again during snapshot creation. Remove the
arg and use region->size instead.
Signed-off-by: NJiri Pirko <jiri@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3a5e5234

tcp: batch calls to sk_flush_backlog() · 1a991488

由 Eric Dumazet 提交于 8月 09, 2019

Starting from commit d41a69f1 ("tcp: make tcp_sendmsg() aware of socket backlog")
loopback flows got hurt, because for each skb sent, the socket receives an
immediate ACK and sk_flush_backlog() causes extra work.

Intent was to not let the backlog grow too much, but we went a bit too far.

We can check the backlog every 16 skbs (about 1MB chunks)
to increase TCP over loopback performance by about 15 %

Note that the call to sk_flush_backlog() handles a single ACK,
thanks to coalescing done on backlog, but cleans the 16 skbs
found in rtx rb-tree.
Reported-by: NSoheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1a991488

09 8月, 2019 26 次提交

liquidio: Use pcie_flr() instead of reimplementing it · fcc32a21

由 Denis Efremov 提交于 8月 08, 2019

octeon_mbox_process_cmd() directly writes the PCI_EXP_DEVCTL_BCR_FLR
bit, which bypasses timing requirements imposed by the PCIe spec.
This patch fixes the function to use the pcie_flr() interface instead.
Signed-off-by: NDenis Efremov <efremov@linux.com>
Reviewed-by: NAndrew Murray <andrew.murray@arm.com>
Reviewed-by: NBjorn Helgaas <bhelgaas@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fcc32a21

r8169: allocate rx buffers using alloc_pages_node · 32879f00

由 Heiner Kallweit 提交于 8月 07, 2019

We allocate 16kb per rx buffer, so we can avoid some overhead by using
alloc_pages_node directly instead of bothering kmalloc_node. Due to
this change buffers are page-aligned now, therefore the alignment check
can be removed.
Signed-off-by: NHeiner Kallweit <hkallweit1@gmail.com>
Acked-by: NHayes Wang <hayeswang@realtek.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

32879f00

fq_codel: remove set but not used variables 'prev_ecn_mark' and 'prev_drop_count' · 018e5b45

由 YueHaibing 提交于 8月 07, 2019

Fixes gcc '-Wunused-but-set-variable' warning:

net/sched/sch_fq_codel.c: In function fq_codel_dequeue:
net/sched/sch_fq_codel.c:288:23: warning: variable prev_ecn_mark set but not used [-Wunused-but-set-variable]
net/sched/sch_fq_codel.c:288:6: warning: variable prev_drop_count set but not used [-Wunused-but-set-variable]

They are not used since commit 77ddaff2 ("fq_codel: Kill
useless per-flow dropped statistic")
Reported-by: NHulk Robot <hulkci@huawei.com>
Signed-off-by: NYueHaibing <yuehaibing@huawei.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

018e5b45

mlxsw: spectrum: Extend to support Spectrum-3 ASIC · da382875

由 Jiri Pirko 提交于 8月 07, 2019

Extend existing driver for Spectrum and Spectrum-2 ASICs
to support Spectrum-3 ASIC as well.
Signed-off-by: NJiri Pirko <jiri@mellanox.com>
Reviewed-by: NPetr Machata <petrm@mellanox.com>
Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

da382875

Merge branch 'stmmac-next' · eb716a64

由 David S. Miller 提交于 8月 08, 2019

Jose Abreu says:

====================
net: stmmac: Improvements for -next

[ This is just a rebase of v2 into latest -next in order to avoid a merge
conflict ]

Couple of improvements for -next tree. More info in commit logs.
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

eb716a64

net: stmmac: selftests: Add a selftest for Flexible RX Parser · ccfc639a

由 Jose Abreu 提交于 8月 07, 2019

Add a selftest for the Flexible RX Parser feature.
Signed-off-by: NJose Abreu <joabreu@synopsys.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ccfc639a

net: stmmac: Add Flexible RX Parser support in XGMAC · d6e1c12c

由 Jose Abreu 提交于 8月 07, 2019

XGMAC cores also support the Flexible RX Parser feature. Add the support
for it in the XGMAC core.
Signed-off-by: NJose Abreu <joabreu@synopsys.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d6e1c12c

net: stmmac: Implement Safety Features in XGMAC core · 56e58d6c

由 Jose Abreu 提交于 8月 07, 2019

XGMAC also supports Safety Features. This patch implements the
configuration and handling of this feature in XGMAC core.
Signed-off-by: NJose Abreu <joabreu@synopsys.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

56e58d6c

net: stmmac: selftests: Add test for VLAN and Double VLAN Filtering · 74043f6b

由 Jose Abreu 提交于 8月 07, 2019

Add a selftest for VLAN and Double VLAN Filtering in stmmac.
Signed-off-by: NJose Abreu <joabreu@synopsys.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

74043f6b

net: stmmac: Implement VLAN Hash Filtering in XGMAC · 3cd1cfcb

由 Jose Abreu 提交于 8月 07, 2019

Implement the VLAN Hash Filtering feature in XGMAC core.
Signed-off-by: NJose Abreu <joabreu@synopsys.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3cd1cfcb

net: stmmac: selftests: Add RSS test · 1fbdad00

由 Jose Abreu 提交于 8月 07, 2019

Add a test for RSS in the stmmac selftests.
Signed-off-by: NJose Abreu <joabreu@synopsys.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1fbdad00

net: stmmac: Implement RSS and enable it in XGMAC core · 76067459

由 Jose Abreu 提交于 8月 07, 2019

Implement the RSS functionality and add the corresponding callbacks in
XGMAC core.

Changes from v1:
	- Do not use magic constants (Jakub)
	- Use ethtool_rxfh_indir_default() (Jakub)
Signed-off-by: NJose Abreu <joabreu@synopsys.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

76067459

net: stmmac: xgmac: Implement tx_queue_prio() · 7035aad8

由 Jose Abreu 提交于 8月 07, 2019

Implement the TX Queue Priority callback in XGMAC core.
Signed-off-by: NJose Abreu <joabreu@synopsys.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7035aad8

net: stmmac: xgmac: Implement set_mtl_tx_queue_weight() · 5656ac55

由 Jose Abreu 提交于 8月 07, 2019

Implement the TX Queue Weight callback. In order for this to be active
we also need to set ETS algorithm when configuring Queue.
Signed-off-by: NJose Abreu <joabreu@synopsys.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5656ac55

net: stmmac: xgmac: Implement MMC counters · b6cdf09f

由 Jose Abreu 提交于 8月 07, 2019

Implement the MMC counters feature in XGMAC core.
Signed-off-by: NJose Abreu <joabreu@synopsys.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b6cdf09f

tipc: add loopback device tracking · 6c9081a3

由 John Rutherford 提交于 8月 07, 2019

Since node internal messages are passed directly to the socket, it is not
possible to observe those messages via tcpdump or wireshark.

We now remedy this by making it possible to clone such messages and send
the clones to the loopback interface.  The clones are dropped at reception
and have no functional role except making the traffic visible.

The feature is enabled if network taps are active for the loopback device.
pcap filtering restrictions require the messages to be presented to the
receiving side of the loopback device.

v3 - Function dev_nit_active used to check for network taps.
   - Procedure netif_rx_ni used to send cloned messages to loopback device.
Signed-off-by: NJohn Rutherford <john.rutherford@dektech.com.au>
Acked-by: NJon Maloy <jon.maloy@ericsson.com>
Acked-by: NYing Xue <ying.xue@windriver.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6c9081a3

Merge branch 'flow_offload-add-indr-block-in-nf_table_offload' · 2339ef1c

由 David S. Miller 提交于 8月 08, 2019

wenxu says:

====================
flow_offload: add indr-block in nf_table_offload

This series patch make nftables offload support the vlan and
tunnel device offload through indr-block architecture.

The first four patches mv tc indr block to flow offload and
rename to flow-indr-block.
Because the new flow-indr-block can't get the tcf_block
directly. The fifth patch provide a callback list to get
flow_block of each subsystem immediately when the device
register and contain a block.
The last patch make nf_tables_offload support flow-indr-block.

This version add a mutex lock for add/del flow_indr_block_ing_cb
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2339ef1c

netfilter: nf_tables_offload: support indr block call · 9a32669f

由 wenxu 提交于 8月 07, 2019

nftable support indr-block call. It makes nftable an offload vlan
and tunnel device.

nft add table netdev firewall
nft add chain netdev firewall aclout { type filter hook ingress offload device mlx_pf0vf0 priority - 300 \; }
nft add rule netdev firewall aclout ip daddr 10.0.0.1 fwd to vlan0
nft add chain netdev firewall aclin { type filter hook ingress device vlan0 priority - 300 \; }
nft add rule netdev firewall aclin ip daddr 10.0.0.7 fwd to mlx_pf0vf0
Signed-off-by: Nwenxu <wenxu@ucloud.cn>
Acked-by: NJakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9a32669f

flow_offload: support get multi-subsystem block · 1150ab0f

由 wenxu 提交于 8月 07, 2019

It provide a callback list to find the blocks of tc
and nft subsystems
Signed-off-by: Nwenxu <wenxu@ucloud.cn>
Acked-by: NJakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1150ab0f

flow_offload: move tc indirect block to flow offload · 4e481908

由 wenxu 提交于 8月 07, 2019

move tc indirect block to flow_offload and rename
it to flow indirect block.The nf_tables can use the
indr block architecture.
Signed-off-by: Nwenxu <wenxu@ucloud.cn>
Acked-by: NJakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4e481908

cls_api: add flow_indr_block_call function · e4da9102

由 wenxu 提交于 8月 07, 2019

This patch make indr_block_call don't access struct tc_indr_block_cb
and tc_indr_block_dev directly
Signed-off-by: Nwenxu <wenxu@ucloud.cn>
Acked-by: NJakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e4da9102

cls_api: remove the tcf_block cache · f8436988

由 wenxu 提交于 8月 07, 2019

Remove the tcf_block in the tc_indr_block_dev for muti-subsystem
support.
Signed-off-by: Nwenxu <wenxu@ucloud.cn>
Acked-by: NJakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f8436988

cls_api: modify the tc_indr_block_ing_cmd parameters. · 242453c2

由 wenxu 提交于 8月 07, 2019

This patch make tc_indr_block_ing_cmd can't access struct
tc_indr_block_dev and tc_indr_block_cb.
Signed-off-by: Nwenxu <wenxu@ucloud.cn>
Acked-by: NJakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

242453c2

Merge branch 'net-batched-receive-in-GRO-path' · 61552d2c

由 David S. Miller 提交于 8月 08, 2019

Edward Cree says:

====================
net: batched receive in GRO path

This series listifies part of GRO processing, in a manner which allows those
 packets which are not GROed (i.e. for which dev_gro_receive returns
 GRO_NORMAL) to be passed on to the listified regular receive path.
dev_gro_receive() itself is not listified, nor the per-protocol GRO
 callback, since GRO's need to hold packets on lists under napi->gro_hash
 makes keeping the packets on other lists awkward, and since the GRO control
 block state of held skbs can refer only to one 'new' skb at a time.
Instead, when napi_frags_finish() handles a GRO_NORMAL result, stash the skb
 onto a list in the napi struct, which is received at the end of the napi
 poll or when its length exceeds the (new) sysctl net.core.gro_normal_batch.

Performance figures with this series, collected on a back-to-back pair of
 Solarflare sfn8522-r2 NICs with 120-second NetPerf tests.  In the stats,
 sample size n for old and new code is 6 runs each; p is from a Welch t-test.
Tests were run both with GRO enabled and disabled, the latter simulating
 uncoalesceable packets (e.g. due to IP or TCP options).  The receive side
 (which was the device under test) had the NetPerf process pinned to one CPU,
 and the device interrupts pinned to a second CPU.  CPU utilisation figures
 (used in cases of line-rate performance) are summed across all CPUs.
net.core.gro_normal_batch was left at its default value of 8.

TCP 4 streams, GRO on: all results line rate (9.415Gbps)
net-next: 210.3% cpu
after #1: 181.5% cpu (-13.7%, p=0.031 vs net-next)
after #3: 196.7% cpu (- 8.4%, p=0.136 vs net-next)
TCP 4 streams, GRO off:
net-next: 8.017 Gbps
after #1: 7.785 Gbps (- 2.9%, p=0.385 vs net-next)
after #3: 7.604 Gbps (- 5.1%, p=0.282 vs net-next.  But note *)
TCP 1 stream, GRO off:
net-next: 6.553 Gbps
after #1: 6.444 Gbps (- 1.7%, p=0.302 vs net-next)
after #3: 6.790 Gbps (+ 3.6%, p=0.169 vs net-next)
TCP 1 stream, GRO on, busy_read = 50: all results line rate
net-next: 156.0% cpu
after #1: 174.5% cpu (+11.9%, p=0.015 vs net-next)
after #3: 165.0% cpu (+ 5.8%, p=0.147 vs net-next)
TCP 1 stream, GRO off, busy_read = 50:
net-next: 6.488 Gbps
after #1: 6.625 Gbps (+ 2.1%, p=0.059 vs net-next)
after #3: 7.351 Gbps (+13.3%, p=0.026 vs net-next)
TCP_RR 100 streams, GRO off, 8000 byte payload
net-next: 995.083 us
after #1: 969.167 us (- 2.6%, p=0.204 vs net-next)
after #3: 976.433 us (- 1.9%, p=0.254 vs net-next)
TCP_RR 100 streams, GRO off, 8000 byte payload, busy_read = 50:
net-next:   2.851 ms
after #1:   2.871 ms (+ 0.7%, p=0.134 vs net-next)
after #3:   2.937 ms (+ 3.0%, p<0.001 vs net-next)
TCP_RR 100 streams, GRO off, 1 byte payload, busy_read = 50:
net-next: 867.317 us
after #1: 865.717 us (- 0.2%, p=0.334 vs net-next)
after #3: 868.517 us (+ 0.1%, p=0.414 vs net-next)

(*) These tests produced a mixture of line-rate and below-line-rate results,
 meaning that statistically speaking the results were 'censored' by the
 upper bound, and were thus not normally distributed, making a Welch t-test
 mathematically invalid.  I therefore also calculated estimators according
 to [1], which gave the following:
net-next: 8.133 Gbps
after #1: 8.130 Gbps (- 0.0%, p=0.499 vs net-next)
after #3: 7.680 Gbps (- 5.6%, p=0.285 vs net-next)
(though my procedure for determining ν wasn't mathematically well-founded
 either, so take that p-value with a grain of salt).
A further check came from dividing the bandwidth figure by the CPU usage for
 each test run, giving:
net-next: 3.461
after #1: 3.198 (- 7.6%, p=0.145 vs net-next)
after #3: 3.641 (+ 5.2%, p=0.280 vs net-next)

The above results are fairly mixed, and in most cases not statistically
 significant.  But I think we can roughly conclude that the series
 marginally improves non-GROable throughput, without hurting latency
 (except in the large-payload busy-polling case, which in any case yields
 horrid performance even on net-next (almost triple the latency without
 busy-poll).  Also, drivers which, unlike sfc, pass UDP traffic to GRO
 would expect to see a benefit from gaining access to batching.

Changed in v3:
 * gro_normal_batch sysctl now uses SYSCTL_ONE instead of &one
 * removed RFC tags (no comments after a week means no-one objects, right?)

Changed in v2:
 * During busy poll, call gro_normal_list() to receive batched packets
   after each cycle of the napi busy loop.  See comments in Patch #3 for
   complications of doing the same in busy_poll_stop().

[1]: Cohen 1959, doi: 10.1080/00401706.1959.10489859
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

61552d2c

net: use listified RX for handling GRO_NORMAL skbs · 323ebb61

由 Edward Cree 提交于 8月 06, 2019

When GRO decides not to coalesce a packet, in napi_frags_finish(), instead
 of passing it to the stack immediately, place it on a list in the napi
 struct.  Then, at flush time (napi_complete_done(), napi_poll(), or
 napi_busy_loop()), call netif_receive_skb_list_internal() on the list.
We'd like to do that in napi_gro_flush(), but it's not called if
 !napi->gro_bitmask, so we have to do it in the callers instead.  (There are
 a handful of drivers that call napi_gro_flush() themselves, but it's not
 clear why, or whether this will affect them.)
Because a full 64 packets is an inefficiently large batch, also consume the
 list whenever it exceeds gro_normal_batch, a new net/core sysctl that
 defaults to 8.
Signed-off-by: NEdward Cree <ecree@solarflare.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

323ebb61

sfc: falcon: don't score irq moderation points for GRO · 67270136

由 Edward Cree 提交于 8月 06, 2019

Same rationale as for sfc, except that this wasn't performance-tested.
Signed-off-by: NEdward Cree <ecree@solarflare.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

67270136

openeuler / Kernel 接近 2 年 前同步成功

openeuler / Kernel
接近 2 年前同步成功