提交 · a577223a97df241df26b91a95d03eec8c9fe0b36 · openeuler / Kernel

03 3月, 2022 1 次提交

net: hamradio: fix compliation error · a577223a

由 Wang Qing 提交于 3月 01, 2022

add missing ")" which caused by previous commit.

Fixes: 61c4fb9c ("net: hamradio: use time_is_after_jiffies() instead of open coding it")
Link: https://lore.kernel.org/all/1646018012-61129-1-git-send-email-wangqing@vivo.com/Signed-off-by: NWang Qing <wangqing@vivo.com>
Link: https://lore.kernel.org/r/1646203277-83159-1-git-send-email-wangqing@vivo.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

a577223a

02 3月, 2022 10 次提交

Merge branch 'if_ether-h-add-industrial-fieldbus-ethertypes' · 96946d89

由 Jakub Kicinski 提交于 3月 01, 2022

Daniel Braunwarth says:

====================
if_ether.h: add industrial fieldbus Ethertypes

This set of patches adds the Ethertypes for PROFINET and EtherCAT.

The defines should be used by iproute2 to extend the list of available link
layer protocols.
====================

Link: https://lore.kernel.org/r/20220228133029.100913-1-daniel@braunwarth.devSigned-off-by: NJakub Kicinski <kuba@kernel.org>

96946d89

if_ether.h: add EtherCAT Ethertype · cd73cda7

由 Daniel Braunwarth 提交于 2月 28, 2022

Add the Ethertype for EtherCAT protocol.
Signed-off-by: NDaniel Braunwarth <daniel@braunwarth.dev>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

cd73cda7

if_ether.h: add PROFINET Ethertype · dd0ca255

由 Daniel Braunwarth 提交于 2月 28, 2022

Add the Ethertype for PROFINET protocol.
Signed-off-by: NDaniel Braunwarth <daniel@braunwarth.dev>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

dd0ca255

macvtap: advertise link netns via netlink · a0219215

由 Sven Eckelmann 提交于 2月 28, 2022

Assign rtnl_link_ops->get_link_net() callback so that IFLA_LINK_NETNSID is
added to rtnetlink messages. This fixes iproute2 which otherwise resolved
the link interface to an interface in the wrong namespace.

Test commands:

  ip netns add nst
  ip link add dummy0 type dummy
  ip link add link macvtap0 link dummy0 type macvtap
  ip link set macvtap0 netns nst
  ip -netns nst link show macvtap0

Before:

  10: macvtap0@gre0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 500
      link/ether 5e:8f:ae:1d:60:50 brd ff:ff:ff:ff:ff:ff

After:

  10: macvtap0@if2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 500
      link/ether 5e:8f:ae:1d:60:50 brd ff:ff:ff:ff:ff:ff link-netnsid 0
Reported-by: NLeonardo Mörlein <freifunk@irrelefant.net>
Signed-off-by: NSven Eckelmann <sven@narfation.org>
Link: https://lore.kernel.org/r/20220228003240.1337426-1-sven@narfation.orgSigned-off-by: NJakub Kicinski <kuba@kernel.org>

a0219215

nfp: avoid newline at end of message in NL_SET_ERR_MSG_MOD · 323d51ca

由 Wan Jiabing 提交于 3月 01, 2022

Fix the following coccicheck warning:
./drivers/net/ethernet/netronome/nfp/flower/qos_conf.c:750:7-55: WARNING
avoid newline at end of message in NL_SET_ERR_MSG_MOD
Signed-off-by: NWan Jiabing <wanjiabing@vivo.com>
Reviewed-by: NSimon Horman <simon.horman@corigine.com>
Link: https://lore.kernel.org/r/20220301112356.1820985-1-wanjiabing@vivo.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

323d51ca

tun: support NAPI for packets received from batched XDP buffs · fb3f9037

由 Harold Huang 提交于 2月 28, 2022

In tun, NAPI is supported and we can also use NAPI in the path of
batched XDP buffs to accelerate packet processing. What is more, after
we use NAPI, GRO is also supported. The iperf shows that the throughput of
single stream could be improved from 4.5Gbps to 9.2Gbps. Additionally, 9.2
Gbps nearly reachs the line speed of the phy nic and there is still about
15% idle cpu core remaining on the vhost thread.

Test topology:
[iperf server]<--->tap<--->dpdk testpmd<--->phy nic<--->[iperf client]

Iperf stream:
iperf3 -c 10.0.0.2  -i 1 -t 10

Before:
...
[  5]   5.00-6.00   sec   558 MBytes  4.68 Gbits/sec    0   1.50 MBytes
[  5]   6.00-7.00   sec   556 MBytes  4.67 Gbits/sec    1   1.35 MBytes
[  5]   7.00-8.00   sec   556 MBytes  4.67 Gbits/sec    2   1.18 MBytes
[  5]   8.00-9.00   sec   559 MBytes  4.69 Gbits/sec    0   1.48 MBytes
[  5]   9.00-10.00  sec   556 MBytes  4.67 Gbits/sec    1   1.33 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  5.39 GBytes  4.63 Gbits/sec   72          sender
[  5]   0.00-10.04  sec  5.39 GBytes  4.61 Gbits/sec               receiver

After:
...
[  5]   5.00-6.00   sec  1.07 GBytes  9.19 Gbits/sec    0   1.55 MBytes
[  5]   6.00-7.00   sec  1.08 GBytes  9.30 Gbits/sec    0   1.63 MBytes
[  5]   7.00-8.00   sec  1.08 GBytes  9.25 Gbits/sec    0   1.72 MBytes
[  5]   8.00-9.00   sec  1.08 GBytes  9.25 Gbits/sec   77   1.31 MBytes
[  5]   9.00-10.00  sec  1.08 GBytes  9.24 Gbits/sec    0   1.48 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  10.8 GBytes  9.28 Gbits/sec  166          sender
[  5]   0.00-10.04  sec  10.8 GBytes  9.24 Gbits/sec               receiver

Reported-at: https://lore.kernel.org/all/CACGkMEvTLG0Ayg+TtbN4q4pPW-ycgCCs3sC3-TF8cuRTf7Pp1A@mail.gmail.comSigned-off-by: NHarold Huang <baymaxhuang@gmail.com>
Acked-by: NJason Wang <jasowang@redhat.com>
Link: https://lore.kernel.org/r/20220228033805.1579435-1-baymaxhuang@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

fb3f9037

Merge branch 'sfc-optimize-rxqs-count-and-affinities' · 422ce836

由 Jakub Kicinski 提交于 3月 01, 2022

Íñigo Huguet says:

====================
sfc: optimize RXQs count and affinities

In sfc driver one RX queue per physical core was allocated by default.
Later on, IRQ affinities were set spreading the IRQs in all NUMA local
CPUs.

However, with that default configuration it result in a non very optimal
configuration in many modern systems. Specifically, in systems with hyper
threading and 2 NUMA nodes, affinities are set in a way that IRQs are
handled by all logical cores of one same NUMA node. Handling IRQs from
both hyper threading siblings has no benefit, and setting affinities to one
queue per physical core is neither a very good idea because there is a
performance penalty for moving data across nodes (I was able to check it
with some XDP tests using pktgen).

This patches reduce the default number of channels to one per physical
core in the local NUMA node. Then, they set IRQ affinities to CPUs in
the local NUMA node only. This way we save hardware resources since
channels are limited resources. We also leave more room for XDP_TX
channels without hitting driver's limit of 32 channels per interface.

Running performance tests using iperf with a SFC9140 device showed no
performance penalty for reducing the number of channels.

RX XDP tests showed that performance can go down to less than half if
the IRQ is handled by a CPU in a different NUMA node, which doesn't
happen with the new defaults from this patches.
====================

Link: https://lore.kernel.org/r/20220228132254.25787-1-ihuguet@redhat.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

422ce836

sfc: set affinity hints in local NUMA node only · 09a99ab1

由 Íñigo Huguet 提交于 2月 28, 2022

Affinity hints were being set to CPUs in local NUMA node first, and then
in other CPUs. This was creating 2 unintended issues:
1. Channels created to be assigned each to a different physical core
   were assigned to hyperthreading siblings because of being in same
   NUMA node.
   Since the patch previous to this one, this did not longer happen
   with default rss_cpus modparam because less channels are created.
2. XDP channels could be assigned to CPUs in different NUMA nodes,
   decreasing performance too much (to less than half in some of my
   tests).

This patch sets the affinity hints spreading the channels only in local
NUMA node's CPUs. A fallback for the case that no CPU in local NUMA node
is online has been added too.

Example of CPUs being assigned in a non optimal way before this and the
previous patch (note: in this system, xdp-8 to xdp-15 are created
because num_possible_cpus == 64, but num_present_cpus == 32 so they're
never used):

$ lscpu | grep -i numa
NUMA node(s):                    2
NUMA node0 CPU(s):               0-7,16-23
NUMA node1 CPU(s):               8-15,24-31

$ grep -H . /proc/irq/*/0000:07:00.0*/../smp_affinity_list
/proc/irq/141/0000:07:00.0-0/../smp_affinity_list:0
/proc/irq/142/0000:07:00.0-1/../smp_affinity_list:1
/proc/irq/143/0000:07:00.0-2/../smp_affinity_list:2
/proc/irq/144/0000:07:00.0-3/../smp_affinity_list:3
/proc/irq/145/0000:07:00.0-4/../smp_affinity_list:4
/proc/irq/146/0000:07:00.0-5/../smp_affinity_list:5
/proc/irq/147/0000:07:00.0-6/../smp_affinity_list:6
/proc/irq/148/0000:07:00.0-7/../smp_affinity_list:7
/proc/irq/149/0000:07:00.0-8/../smp_affinity_list:16
/proc/irq/150/0000:07:00.0-9/../smp_affinity_list:17
/proc/irq/151/0000:07:00.0-10/../smp_affinity_list:18
/proc/irq/152/0000:07:00.0-11/../smp_affinity_list:19
/proc/irq/153/0000:07:00.0-12/../smp_affinity_list:20
/proc/irq/154/0000:07:00.0-13/../smp_affinity_list:21
/proc/irq/155/0000:07:00.0-14/../smp_affinity_list:22
/proc/irq/156/0000:07:00.0-15/../smp_affinity_list:23
/proc/irq/157/0000:07:00.0-xdp-0/../smp_affinity_list:8
/proc/irq/158/0000:07:00.0-xdp-1/../smp_affinity_list:9
/proc/irq/159/0000:07:00.0-xdp-2/../smp_affinity_list:10
/proc/irq/160/0000:07:00.0-xdp-3/../smp_affinity_list:11
/proc/irq/161/0000:07:00.0-xdp-4/../smp_affinity_list:12
/proc/irq/162/0000:07:00.0-xdp-5/../smp_affinity_list:13
/proc/irq/163/0000:07:00.0-xdp-6/../smp_affinity_list:14
/proc/irq/164/0000:07:00.0-xdp-7/../smp_affinity_list:15
/proc/irq/165/0000:07:00.0-xdp-8/../smp_affinity_list:24
/proc/irq/166/0000:07:00.0-xdp-9/../smp_affinity_list:25
/proc/irq/167/0000:07:00.0-xdp-10/../smp_affinity_list:26
/proc/irq/168/0000:07:00.0-xdp-11/../smp_affinity_list:27
/proc/irq/169/0000:07:00.0-xdp-12/../smp_affinity_list:28
/proc/irq/170/0000:07:00.0-xdp-13/../smp_affinity_list:29
/proc/irq/171/0000:07:00.0-xdp-14/../smp_affinity_list:30
/proc/irq/172/0000:07:00.0-xdp-15/../smp_affinity_list:31

CPUs assignments after this and previous patch, so normal channels
created only one per core in NUMA node and affinities set only to local
NUMA node:

$ grep -H . /proc/irq/*/0000:07:00.0*/../smp_affinity_list
/proc/irq/116/0000:07:00.0-0/../smp_affinity_list:0
/proc/irq/117/0000:07:00.0-1/../smp_affinity_list:1
/proc/irq/118/0000:07:00.0-2/../smp_affinity_list:2
/proc/irq/119/0000:07:00.0-3/../smp_affinity_list:3
/proc/irq/120/0000:07:00.0-4/../smp_affinity_list:4
/proc/irq/121/0000:07:00.0-5/../smp_affinity_list:5
/proc/irq/122/0000:07:00.0-6/../smp_affinity_list:6
/proc/irq/123/0000:07:00.0-7/../smp_affinity_list:7
/proc/irq/124/0000:07:00.0-xdp-0/../smp_affinity_list:16
/proc/irq/125/0000:07:00.0-xdp-1/../smp_affinity_list:17
/proc/irq/126/0000:07:00.0-xdp-2/../smp_affinity_list:18
/proc/irq/127/0000:07:00.0-xdp-3/../smp_affinity_list:19
/proc/irq/128/0000:07:00.0-xdp-4/../smp_affinity_list:20
/proc/irq/129/0000:07:00.0-xdp-5/../smp_affinity_list:21
/proc/irq/130/0000:07:00.0-xdp-6/../smp_affinity_list:22
/proc/irq/131/0000:07:00.0-xdp-7/../smp_affinity_list:23
/proc/irq/132/0000:07:00.0-xdp-8/../smp_affinity_list:0
/proc/irq/133/0000:07:00.0-xdp-9/../smp_affinity_list:1
/proc/irq/134/0000:07:00.0-xdp-10/../smp_affinity_list:2
/proc/irq/135/0000:07:00.0-xdp-11/../smp_affinity_list:3
/proc/irq/136/0000:07:00.0-xdp-12/../smp_affinity_list:4
/proc/irq/137/0000:07:00.0-xdp-13/../smp_affinity_list:5
/proc/irq/138/0000:07:00.0-xdp-14/../smp_affinity_list:6
/proc/irq/139/0000:07:00.0-xdp-15/../smp_affinity_list:7
Signed-off-by: NÍñigo Huguet <ihuguet@redhat.com>
Acked-by: NMartin Habets <habetsm.xilinx@gmail.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

09a99ab1

sfc: default config to 1 channel/core in local NUMA node only · c265b569

由 Íñigo Huguet 提交于 2月 28, 2022

Handling channels from CPUs in different NUMA node can penalize
performance, so better configure only one channel per core in the same
NUMA node than the NIC, and not per each core in the system.

Fallback to all other online cores if there are not online CPUs in local
NUMA node.
Signed-off-by: NÍñigo Huguet <ihuguet@redhat.com>
Acked-by: NMartin Habets <habetsm.xilinx@gmail.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

c265b569

net: smc: fix different types in min() · ef739f1d

由 Jakub Kicinski 提交于 3月 01, 2022

Fix build:

 include/linux/minmax.h:45:25: note: in expansion of macro ‘__careful_cmp’
   45 | #define min(x, y)       __careful_cmp(x, y, <)
      |                         ^~~~~~~~~~~~~
 net/smc/smc_tx.c:150:24: note: in expansion of macro ‘min’
  150 |         corking_size = min(sock_net(&smc->sk)->smc.sysctl_autocorking_size,
      |                        ^~~

Fixes: 12bbb0d1 ("net/smc: add sysctl for autocorking")
Link: https://lore.kernel.org/r/20220301222446.1271127-1-kuba@kernel.orgSigned-off-by: NJakub Kicinski <kuba@kernel.org>

ef739f1d

01 3月, 2022 24 次提交

Merge branch 'smc-datapath-opts' · 7282c126

由 David S. Miller 提交于 3月 01, 2022

Dust Li says:

====================
net/smc: some datapath performance optimizations

This series tries to improve the performance of SMC in datapath.

- patch #1, add sysctl interface to support tuning the behaviour of
  SMC in container environment.

- patch #2/#3, add autocorking support which is very efficient for small
  messages without trade-off for latency.

- patch #4, send directly on setting TCP_NODELAY, without wake up the
  TX worker, this make it consistent with clearing TCP_CORK.

- patch #5, this correct the setting of RMB window update limit, so
  we don't send CDC messages to update peer's RMB window too frequently
  in some cases.

- patch #6, implemented something like NAPI in SMC, decrease the number
  of hardirq when busy.

- patch #7, this moves TX work doing in the BH to the user context when
  sock_lock is hold by user.

With this patchset applied, we can get a good performance gain:
- qperf tcp_bw test has shown a great improvement. Other benchmarks like
  'netperf TCP_STREAM' or 'sockperf throughput' has similar result.
- In my testing environment, running qperf tcp_bw and tcp_lat, SMC behaves
  better then TCP in most all message size.

Here are some test results with the following testing command:
client: smc_run taskset -c 1 qperf smc-server -oo msg_size:1:64K:*2 \
		-t 30 -vu tcp_{bw|lat}
server: smc_run taskset -c 1 qperf

==== Bandwidth ====
 MsgSize        Origin SMC              TCP                SMC with patches
       1         0.578 MB/s      2.392 MB/s(313.57%)      2.561 MB/s(342.83%)
       2         1.159 MB/s      4.780 MB/s(312.53%)      5.162 MB/s(345.46%)
       4         2.283 MB/s     10.266 MB/s(349.77%)     10.122 MB/s(343.46%)
       8         4.668 MB/s     19.040 MB/s(307.86%)     20.521 MB/s(339.59%)
      16         9.147 MB/s     38.904 MB/s(325.31%)     40.823 MB/s(346.29%)
      32        18.369 MB/s     79.587 MB/s(333.25%)     80.535 MB/s(338.42%)
      64        36.562 MB/s    148.668 MB/s(306.61%)    158.170 MB/s(332.60%)
     128        72.961 MB/s    274.913 MB/s(276.80%)    316.217 MB/s(333.41%)
     256       144.705 MB/s    512.059 MB/s(253.86%)    626.019 MB/s(332.62%)
     512       288.873 MB/s    884.977 MB/s(206.35%)   1221.596 MB/s(322.88%)
    1024       574.180 MB/s   1337.736 MB/s(132.98%)   2203.156 MB/s(283.70%)
    2048      1095.192 MB/s   1865.952 MB/s( 70.38%)   3036.448 MB/s(177.25%)
    4096      2066.157 MB/s   2380.337 MB/s( 15.21%)   3834.271 MB/s( 85.58%)
    8192      3717.198 MB/s   2733.073 MB/s(-26.47%)   4904.910 MB/s( 31.95%)
   16384      4742.221 MB/s   2958.693 MB/s(-37.61%)   5220.272 MB/s( 10.08%)
   32768      5349.550 MB/s   3061.285 MB/s(-42.77%)   5321.865 MB/s( -0.52%)
   65536      5162.919 MB/s   3731.408 MB/s(-27.73%)   5245.021 MB/s(  1.59%)
==== Latency ====
 MsgSize        Origin SMC              TCP                SMC with patches
       1        10.540 us     11.938 us( 13.26%)         10.356 us( -1.75%)
       2        10.996 us     11.992 us(  9.06%)         10.073 us( -8.39%)
       4        10.229 us     11.687 us( 14.25%)          9.996 us( -2.28%)
       8        10.203 us     11.653 us( 14.21%)         10.063 us( -1.37%)
      16        10.530 us     11.313 us(  7.44%)         10.013 us( -4.91%)
      32        10.241 us     11.586 us( 13.13%)         10.081 us( -1.56%)
      64        10.693 us     11.652 us(  8.97%)          9.986 us( -6.61%)
     128        10.597 us     11.579 us(  9.27%)         10.262 us( -3.16%)
     256        10.409 us     11.957 us( 14.87%)         10.148 us( -2.51%)
     512        11.088 us     12.505 us( 12.78%)         10.206 us( -7.95%)
    1024        11.240 us     12.255 us(  9.03%)         10.631 us( -5.42%)
    2048        11.485 us     16.970 us( 47.76%)         10.981 us( -4.39%)
    4096        12.077 us     13.948 us( 15.49%)         11.847 us( -1.90%)
    8192        13.683 us     16.693 us( 22.00%)         13.336 us( -2.54%)
   16384        16.470 us     23.615 us( 43.38%)         16.519 us(  0.30%)
   32768        22.540 us     40.966 us( 81.75%)         22.452 us( -0.39%)
   65536        34.192 us     73.003 us(113.51%)         33.916 us( -0.81%)

------------
Test environment notes:
1. Testing is run on 2 VMs within the same physical host
2. The NIC is ConnectX-4Lx, using SRIOV, and passing through 2 VFs to the
   2 VMs respectively.
3. To decrease jitter, VM's vCPU are binded to each physical CPU, and those
   physical CPUs are all isolated using boot parameter `isolcpus=xxx`
4. The queue number are set to 1, and interrupt from the queue is binded to
   CPU0 in the guest
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7282c126

net/smc: don't send in the BH context if sock_owned_by_user · 6b88af83

由 Dust Li 提交于 3月 01, 2022

Send data all the way down to the RDMA device is a time
consuming operation(get a new slot, maybe do RDMA Write
and send a CDC, etc). Moving those operations from BH
to user context is good for performance.

If the sock_lock is hold by user, we don't try to send
data out in the BH context, but just mark we should
send. Since the user will release the sock_lock soon, we
can do the sending there.

Add smc_release_cb() which will be called in release_sock()
and try send in the callback if needed.

This patch moves the sending part out from BH if sock lock
is hold by user. In my testing environment, this saves about
20% softirq in the qperf 4K tcp_bw test in the sender side
with no noticeable throughput drop.
Signed-off-by: NDust Li <dust.li@linux.alibaba.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6b88af83

net/smc: don't req_notify until all CQEs drained · a505cce6

由 Dust Li 提交于 3月 01, 2022

When we are handling softirq workload, enable hardirq may
again interrupt the current routine of softirq, and then
try to raise softirq again. This only wastes CPU cycles
and won't have any real gain.

Since IB_CQ_REPORT_MISSED_EVENTS already make sure if
ib_req_notify_cq() returns 0, it is safe to wait for the
next event, with no need to poll the CQ again in this case.

This patch disables hardirq during the processing of softirq,
and re-arm the CQ after softirq is done. Somehow like NAPI.
Co-developed-by: NGuangguan Wang <guangguan.wang@linux.alibaba.com>
Signed-off-by: NGuangguan Wang <guangguan.wang@linux.alibaba.com>
Signed-off-by: NDust Li <dust.li@linux.alibaba.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a505cce6

net/smc: correct settings of RMB window update limit · 6bf536eb

由 Dust Li 提交于 3月 01, 2022

rmbe_update_limit is used to limit announcing receive
window updating too frequently. RFC7609 request a minimal
increase in the window size of 10% of the receive buffer
space. But current implementation used:

  min_t(int, rmbe_size / 10, SOCK_MIN_SNDBUF / 2)

and SOCK_MIN_SNDBUF / 2 == 2304 Bytes, which is almost
always less then 10% of the receive buffer space.

This causes the receiver always sending CDC message to
update its consumer cursor when it consumes more then 2K
of data. And as a result, we may encounter something like
"TCP silly window syndrome" when sending 2.5~8K message.

This patch fixes this using max(rmbe_size / 10, SOCK_MIN_SNDBUF / 2).

With this patch and SMC autocorking enabled, qperf 2K/4K/8K
tcp_bw test shows 45%/75%/40% increase in throughput respectively.
Signed-off-by: NDust Li <dust.li@linux.alibaba.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6bf536eb

net/smc: send directly on setting TCP_NODELAY · b70a5cc0

由 Dust Li 提交于 3月 01, 2022

In commit ea785a1a("net/smc: Send directly when
TCP_CORK is cleared"), we don't use delayed work
to implement cork.

This patch use the same algorithm, removes the
delayed work when setting TCP_NODELAY and send
directly in setsockopt(). This also makes the
TCP_NODELAY the same as TCP.

Cc: Tony Lu <tonylu@linux.alibaba.com>
Signed-off-by: NDust Li <dust.li@linux.alibaba.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b70a5cc0

net/smc: add sysctl for autocorking · 12bbb0d1

由 Dust Li 提交于 3月 01, 2022

This add a new sysctl: net.smc.autocorking_size

We can dynamically change the behaviour of autocorking
by change the value of autocorking_size.
Setting to 0 disables autocorking in SMC
Signed-off-by: NDust Li <dust.li@linux.alibaba.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

12bbb0d1

net/smc: add autocorking support · dcd2cf5f

由 Dust Li 提交于 3月 01, 2022

This patch adds autocorking support for SMC which could improve
throughput for small message by x3+.

The main idea is borrowed from TCP autocorking with some RDMA
specific modification:
1. The first message should never cork to make sure we won't
bring extra latency
2. If we have posted any Tx WRs to the NIC that have not
completed, cork the new messages until:
a) Receive CQE for the last Tx WR
b) We have corked enough message on the connection
3. Try to push the corked data out when we receive CQE of
the last Tx WR to prevent the corked messages hang in
the send queue.

Both SMC autocorking and TCP autocorking check the TX completion
to decide whether we should cork or not. The difference is
when we got a SMC Tx WR completion, the data have been confirmed
by the RNIC while TCP TX completion just tells us the data
have been sent out by the local NIC.

Add an atomic variable tx_pushing in smc_connection to make
sure only one can send to let it cork more and save CDC slot.

SMC autocorking should not bring extra latency since the first
message will always been sent out immediately.

The qperf tcp_bw test shows more than x4 increase under small
message size with Mellanox connectX4-Lx, same result with other
throughput benchmarks like sockperf/netperf.
The qperf tcp_lat test shows SMC autocorking has not increase any
ping-pong latency.

Test command:
client: smc_run taskset -c 1 qperf smc-server -oo msg_size:1:64K:*2 \
-t 30 -vu tcp_{bw|lat}
server: smc_run taskset -c 1 qperf

=== Bandwidth ====
MsgSize(Bytes) SMC-NoCork TCP SMC-AutoCorking
1 0.578 MB/s 2.392 MB/s(313.57%) 2.647 MB/s(357.72%)
2 1.159 MB/s 4.780 MB/s(312.53%) 5.153 MB/s(344.71%)
4 2.283 MB/s 10.266 MB/s(349.77%) 10.363 MB/s(354.02%)
8 4.668 MB/s 19.040 MB/s(307.86%) 21.215 MB/s(354.45%)
16 9.147 MB/s 38.904 MB/s(325.31%) 41.740 MB/s(356.32%)
32 18.369 MB/s 79.587 MB/s(333.25%) 82.392 MB/s(348.52%)
64 36.562 MB/s 148.668 MB/s(306.61%) 161.564 MB/s(341.89%)
128 72.961 MB/s 274.913 MB/s(276.80%) 325.363 MB/s(345.94%)
256 144.705 MB/s 512.059 MB/s(253.86%) 633.743 MB/s(337.96%)
512 288.873 MB/s 884.977 MB/s(206.35%) 1250.681 MB/s(332.95%)
1024 574.180 MB/s 1337.736 MB/s(132.98%) 2246.121 MB/s(291.19%)
2048 1095.192 MB/s 1865.952 MB/s( 70.38%) 2057.767 MB/s( 87.89%)
4096 2066.157 MB/s 2380.337 MB/s( 15.21%) 2173.983 MB/s( 5.22%)
8192 3717.198 MB/s 2733.073 MB/s(-26.47%) 3491.223 MB/s( -6.08%)
16384 4742.221 MB/s 2958.693 MB/s(-37.61%) 4637.692 MB/s( -2.20%)
32768 5349.550 MB/s 3061.285 MB/s(-42.77%) 5385.796 MB/s( 0.68%)
65536 5162.919 MB/s 3731.408 MB/s(-27.73%) 5223.890 MB/s( 1.18%)
==== Latency ====
MsgSize(Bytes) SMC-NoCork TCP SMC-AutoCorking
1 10.540 us 11.938 us( 13.26%) 10.573 us( 0.31%)
2 10.996 us 11.992 us( 9.06%) 10.269 us( -6.61%)
4 10.229 us 11.687 us( 14.25%) 10.240 us( 0.11%)
8 10.203 us 11.653 us( 14.21%) 10.402 us( 1.95%)
16 10.530 us 11.313 us( 7.44%) 10.599 us( 0.66%)
32 10.241 us 11.586 us( 13.13%) 10.223 us( -0.18%)
64 10.693 us 11.652 us( 8.97%) 10.251 us( -4.13%)
128 10.597 us 11.579 us( 9.27%) 10.494 us( -0.97%)
256 10.409 us 11.957 us( 14.87%) 10.710 us( 2.89%)
512 11.088 us 12.505 us( 12.78%) 10.547 us( -4.88%)
1024 11.240 us 12.255 us( 9.03%) 10.787 us( -4.03%)
2048 11.485 us 16.970 us( 47.76%) 11.256 us( -1.99%)
4096 12.077 us 13.948 us( 15.49%) 12.230 us( 1.27%)
8192 13.683 us 16.693 us( 22.00%) 13.786 us( 0.75%)
16384 16.470 us 23.615 us( 43.38%) 16.459 us( -0.07%)
32768 22.540 us 40.966 us( 81.75%) 23.284 us( 3.30%)
65536 34.192 us 73.003 us(113.51%) 34.233 us( 0.12%)

With SMC autocorking support, we can archive better throughput
than TCP in most message sizes without any latency trade-off.
Signed-off-by: NDust Li <dust.li@linux.alibaba.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

dcd2cf5f

net/smc: add sysctl interface for SMC · 462791bb

由 Dust Li 提交于 3月 01, 2022

This patch add sysctl interface to support container environment
for SMC as we talk in the mail list.

Link: https://lore.kernel.org/netdev/20220224020253.GF5443@linux.alibaba.comCo-developed-by: NTony Lu <tonylu@linux.alibaba.com>
Signed-off-by: NTony Lu <tonylu@linux.alibaba.com>
Signed-off-by: NDust Li <dust.li@linux.alibaba.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

462791bb

Merge branch 'vxlan-vnifiltering' · 1e385c08

由 David S. Miller 提交于 3月 01, 2022

Roopa Prabhu says:

====================
vxlan metadata device vnifiltering support

This series adds vnifiltering support to vxlan collect metadata device.

Motivation:
You can only use a single vxlan collect metadata device for a given
vxlan udp port in the system today. The vxlan collect metadata device
terminates all received vxlan packets. As shown in the below diagram,
there are use-cases where you need to support multiple such vxlan devices in
independent bridge domains. Each vxlan device must terminate the vni's
it is configured for.
Example usecase: In a service provider network a service provider
typically supports multiple bridge domains with overlapping vlans.
One bridge domain per customer. Vlans in each bridge domain are
mapped to globally unique vxlan ranges assigned to each customer.

This series adds vnifiltering support to collect metadata devices to
terminate only configured vnis. This is similar to vlan filtering in
bridge driver. The vni filtering capability is provided by a new flag on
collect metadata device.

In the below pic:
	- customer1 is mapped to br1 bridge domain
	- customer2 is mapped to br2 bridge domain
	- customer1 vlan 10-11 is mapped to vni 1001-1002
	- customer2 vlan 10-11 is mapped to vni 2001-2002
	- br1 and br2 are vlan filtering bridges
	- vxlan1 and vxlan2 are collect metadata devices with
	  vnifiltering enabled

┌──────────────────────────────────────────────────────────────────┐
│  switch                                                          │
│                                                                  │
│         ┌───────────┐                 ┌───────────┐              │
│         │           │                 │           │              │
│         │   br1     │                 │   br2     │              │
│         └┬─────────┬┘                 └──┬───────┬┘              │
│     vlans│         │               vlans │       │               │
│     10,11│         │                10,11│       │               │
│          │     vlanvnimap:               │    vlanvnimap:        │
│          │       10-1001,11-1002         │      10-2001,11-2002  │
│          │         │                     │       │               │
│   ┌──────┴┐     ┌──┴─────────┐       ┌───┴────┐  │               │
│   │ swp1  │     │vxlan1      │       │ swp2   │ ┌┴─────────────┐ │
│   │       │     │  vnifilter:│       │        │ │vxlan2        │ │
│   └───┬───┘     │   1001,1002│       └───┬────┘ │ vnifilter:   │ │
│       │         └────────────┘           │      │  2001,2002   │ │
│       │                                  │      └──────────────┘ │
│       │                                  │                       │
└───────┼──────────────────────────────────┼───────────────────────┘
        │                                  │
        │                                  │
  ┌─────┴───────┐                          │
  │  customer1  │                    ┌─────┴──────┐
  │ host/VM     │                    │customer2   │
  └─────────────┘                    │ host/VM    │
                                     └────────────┘

v2:
  - remove stale xstats declarations pointed out by Nikolay Aleksandrov
  - squash selinux patch with the tunnel api patch as pointed out by
    benjamin poirier
  - Fix various build issues:
Reported-by: Nkernel test robot <lkp@intel.com>

v3:
  - incorporate review feedback from Jakub
	- move rhashtable declarations to c file
	- define and use netlink policy for top level vxlan filter api
	- fix unused stats function warning
	- pass vninode from vnifilter lookup into stats count function
		to avoid another lookup (only applicable to vxlan_rcv)
	- fix missing vxlan vni delete notifications in vnifilter uninit
	  function
	- misc cleanups
  - remote dev check for multicast groups added via vnifiltering api
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1e385c08

drivers: vxlan: vnifilter: add support for stats dumping · 445b2f36

由 Nikolay Aleksandrov 提交于 3月 01, 2022

Add support for VXLAN vni filter entries' stats dumping
Signed-off-by: NNikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

445b2f36

drivers: vxlan: vnifilter: per vni stats · 4095e0e1

由 Nikolay Aleksandrov 提交于 3月 01, 2022

Add per-vni statistics for vni filter mode. Counting Rx/Tx
bytes/packets/drops/errors at the appropriate places.

This patch changes vxlan_vs_find_vni to also return the
vxlan_vni_node in cases where the vni belongs to a vni
filtering vxlan device
Signed-off-by: NNikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: NRoopa Prabhu <roopa@nvidia.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4095e0e1

selftests: add new tests for vxlan vnifiltering · 3edf5f66