提交 · e40b801b3603a8f90b46acbacdea3505c27f01c0 · openeuler / Kernel

20 2月, 2023 1 次提交

net/smc: fix potential panic dues to unprotected smc_llc_srv_add_link() · e40b801b

由 D. Wythe 提交于 2月 16, 2023

There is a certain chance to trigger the following panic:

PID: 5900   TASK: ffff88c1c8af4100  CPU: 1   COMMAND: "kworker/1:48"
 #0 [ffff9456c1cc79a0] machine_kexec at ffffffff870665b7
 #1 [ffff9456c1cc79f0] __crash_kexec at ffffffff871b4c7a
 #2 [ffff9456c1cc7ab0] crash_kexec at ffffffff871b5b60
 #3 [ffff9456c1cc7ac0] oops_end at ffffffff87026ce7
 #4 [ffff9456c1cc7ae0] page_fault_oops at ffffffff87075715
 #5 [ffff9456c1cc7b58] exc_page_fault at ffffffff87ad0654
 #6 [ffff9456c1cc7b80] asm_exc_page_fault at ffffffff87c00b62
    [exception RIP: ib_alloc_mr+19]
    RIP: ffffffffc0c9cce3  RSP: ffff9456c1cc7c38  RFLAGS: 00010202
    RAX: 0000000000000000  RBX: 0000000000000002  RCX: 0000000000000004
    RDX: 0000000000000010  RSI: 0000000000000000  RDI: 0000000000000000
    RBP: ffff88c1ea281d00   R8: 000000020a34ffff   R9: ffff88c1350bbb20
    R10: 0000000000000000  R11: 0000000000000001  R12: 0000000000000000
    R13: 0000000000000010  R14: ffff88c1ab040a50  R15: ffff88c1ea281d00
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #7 [ffff9456c1cc7c60] smc_ib_get_memory_region at ffffffffc0aff6df [smc]
 #8 [ffff9456c1cc7c88] smcr_buf_map_link at ffffffffc0b0278c [smc]
 #9 [ffff9456c1cc7ce0] __smc_buf_create at ffffffffc0b03586 [smc]

The reason here is that when the server tries to create a second link,
smc_llc_srv_add_link() has no protection and may add a new link to
link group. This breaks the security environment protected by
llc_conf_mutex.

Fixes: 2d2209f2 ("net/smc: first part of add link processing as SMC server")
Signed-off-by: ND. Wythe <alibuda@linux.alibaba.com>
Reviewed-by: NLarysa Zaremba <larysa.zaremba@intel.com>
Reviewed-by: NWenjia Zhang <wenjia@linux.ibm.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e40b801b

03 11月, 2022 1 次提交

net/smc: Fix possible leaked pernet namespace in smc_init() · 62ff373d

由 Chen Zhongjin 提交于 11月 01, 2022

In smc_init(), register_pernet_subsys(&smc_net_stat_ops) is called
without any error handling.
If it fails, registering of &smc_net_ops won't be reverted.
And if smc_nl_init() fails, &smc_net_stat_ops itself won't be reverted.

This leaves wild ops in subsystem linkedlist and when another module
tries to call register_pernet_operations() it triggers page fault:

BUG: unable to handle page fault for address: fffffbfff81b964c
RIP: 0010:register_pernet_operations+0x1b9/0x5f0
Call Trace:
  <TASK>
  register_pernet_subsys+0x29/0x40
  ebtables_init+0x58/0x1000 [ebtables]
  ...

Fixes: 194730a9 ("net/smc: Make SMC statistics network namespace aware")
Signed-off-by: NChen Zhongjin <chenzhongjin@huawei.com>
Reviewed-by: NTony Lu <tonylu@linux.alibaba.com>
Reviewed-by: NWenjia Zhang <wenjia@linux.ibm.com>
Link: https://lore.kernel.org/r/20221101093722.127223-1-chenzhongjin@huawei.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

62ff373d

27 9月, 2022 1 次提交

net/smc: Support SO_REUSEPORT · 6627a207

由 Tony Lu 提交于 9月 22, 2022

This enables SO_REUSEPORT [1] for clcsock when it is set on smc socket,
so that some applications which uses it can be transparently replaced
with SMC. Also, this helps improve load distribution.

Here is a simple test of NGINX + wrk with SMC. The CPU usage is collected
on NGINX (server) side as below.

Disable SO_REUSEPORT:

05:15:33 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
05:15:34 PM all 7.02 0.00 11.86 0.00 2.04 8.93 0.00 0.00 0.00 70.15
05:15:34 PM 0 0.00 0.00 0.00 0.00 16.00 70.00 0.00 0.00 0.00 14.00
05:15:34 PM 1 11.58 0.00 22.11 0.00 0.00 0.00 0.00 0.00 0.00 66.32
05:15:34 PM 2 1.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 98.00
05:15:34 PM 3 16.84 0.00 30.53 0.00 0.00 0.00 0.00 0.00 0.00 52.63
05:15:34 PM 4 28.72 0.00 44.68 0.00 0.00 0.00 0.00 0.00 0.00 26.60
05:15:34 PM 5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
05:15:34 PM 6 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
05:15:34 PM 7 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00

Enable SO_REUSEPORT:

05:15:20 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
05:15:21 PM all 8.56 0.00 14.40 0.00 2.20 9.86 0.00 0.00 0.00 64.98
05:15:21 PM 0 0.00 0.00 4.08 0.00 14.29 76.53 0.00 0.00 0.00 5.10
05:15:21 PM 1 9.09 0.00 16.16 0.00 1.01 0.00 0.00 0.00 0.00 73.74
05:15:21 PM 2 9.38 0.00 16.67 0.00 1.04 0.00 0.00 0.00 0.00 72.92
05:15:21 PM 3 10.42 0.00 17.71 0.00 1.04 0.00 0.00 0.00 0.00 70.83
05:15:21 PM 4 9.57 0.00 15.96 0.00 0.00 0.00 0.00 0.00 0.00 74.47
05:15:21 PM 5 9.18 0.00 15.31 0.00 0.00 1.02 0.00 0.00 0.00 74.49
05:15:21 PM 6 8.60 0.00 15.05 0.00 0.00 0.00 0.00 0.00 0.00 76.34
05:15:21 PM 7 12.37 0.00 14.43 0.00 0.00 0.00 0.00 0.00 0.00 73.20

Using SO_REUSEPORT helps the load distribution of NGINX be more
balanced.

[1] https://man7.org/linux/man-pages/man7/socket.7.htmlSigned-off-by: NTony Lu <tonylu@linux.alibaba.com>
Acked-by: NWenjia Zhang <wenjia@linux.ibm.com>
Link: https://lore.kernel.org/r/20220922121906.72406-1-tonylu@linux.alibaba.comSigned-off-by: NPaolo Abeni <pabeni@redhat.com>

6627a207

22 9月, 2022 1 次提交

net/smc: Unbind r/w buffer size from clcsock and make them tunable · 0227f058

由 Tony Lu 提交于 9月 20, 2022

Currently, SMC uses smc->sk.sk_{rcv|snd}buf to create buffers for
send buffer and RMB. And the values of buffer size are from tcp_{w|r}mem
in clcsock.

The buffer size from TCP socket doesn't fit SMC well. Generally, buffers
are usually larger than TCP for SMC-R/-D to get higher performance, for
they are different underlay devices and paths.

So this patch unbinds buffer size from TCP, and introduces two sysctl
knobs to tune them independently. Also, these knobs are per net
namespace and work for containers.
Signed-off-by: NTony Lu <tonylu@linux.alibaba.com>
Signed-off-by: NPaolo Abeni <pabeni@redhat.com>

0227f058

01 9月, 2022 1 次提交

net/smc: Remove redundant refcount increase · a8424a9b

由 Yacan Liu 提交于 8月 30, 2022

For passive connections, the refcount increment has been done in
smc_clcsock_accept()-->smc_sock_alloc().

Fixes: 3b2dec26 ("net/smc: restructure client and server code in af_smc")
Signed-off-by: NYacan Liu <liuyacan@corp.netease.com>
Reviewed-by: NTony Lu <tonylu@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220830152314.838736-1-liuyacan@corp.netease.comSigned-off-by: NPaolo Abeni <pabeni@redhat.com>

a8424a9b

27 7月, 2022 1 次提交

net/smc: Enable module load on netlink usage · 28ec53f3

由 Stefan Raspl 提交于 7月 25, 2022

Previously, the smc and smc_diag modules were automatically loaded as
dependencies of the ism module whenever an ISM device was present.
With the pending rework of the ISM API, the smc module will no longer
automatically be loaded in presence of an ISM device. Usage of an AF_SMC
socket will still trigger loading of the smc modules, but usage of a
netlink socket will not.
This is addressed by setting the correct module aliases.
Signed-off-by: NStefan Raspl <raspl@linux.ibm.com>
Signed-off-by: Wenjia Zhang < wenjia@linux.ibm.com>
Reviewed-by: NTony Lu <tonylu@linux.alibaba.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

28ec53f3

18 7月, 2022 2 次提交

net/smc: Allow virtually contiguous sndbufs or RMBs for SMC-R · b8d19945

由 Wen Gu 提交于 7月 14, 2022

On long-running enterprise production servers, high-order contiguous
memory pages are usually very rare and in most cases we can only get
fragmented pages.

When replacing TCP with SMC-R in such production scenarios, attempting
to allocate high-order physically contiguous sndbufs and RMBs may result
in frequent memory compaction, which will cause unexpected hung issue
and further stability risks.

So this patch is aimed to allow SMC-R link group to use virtually
contiguous sndbufs and RMBs to avoid potential issues mentioned above.
Whether to use physically or virtually contiguous buffers can be set
by sysctl smcr_buf_type.

Note that using virtually contiguous buffers will bring an acceptable
performance regression, which can be mainly divided into two parts:

1) regression in data path, which is brought by additional address
   translation of sndbuf by RNIC in Tx. But in general, translating
   address through MTT is fast.

   Taking 256KB sndbuf and RMB as an example, the comparisons in qperf
   latency and bandwidth test with physically and virtually contiguous
   buffers are as follows:

- client:
  smc_run taskset -c <cpu> qperf <server> -oo msg_size:1:64K:*2\
  -t 5 -vu tcp_{bw|lat}
- server:
  smc_run taskset -c <cpu> qperf

   [latency]
   msgsize              tcp            smcr        smcr-use-virt-buf
   1               11.17 us         7.56 us         7.51 us (-0.67%)
   2               10.65 us         7.74 us         7.56 us (-2.31%)
   4               11.11 us         7.52 us         7.59 us ( 0.84%)
   8               10.83 us         7.55 us         7.51 us (-0.48%)
   16              11.21 us         7.46 us         7.51 us ( 0.71%)
   32              10.65 us         7.53 us         7.58 us ( 0.61%)
   64              10.95 us         7.74 us         7.80 us ( 0.76%)
   128             11.14 us         7.83 us         7.87 us ( 0.47%)
   256             10.97 us         7.94 us         7.92 us (-0.28%)
   512             11.23 us         7.94 us         8.20 us ( 3.25%)
   1024            11.60 us         8.12 us         8.20 us ( 0.96%)
   2048            14.04 us         8.30 us         8.51 us ( 2.49%)
   4096            16.88 us         9.13 us         9.07 us (-0.64%)
   8192            22.50 us        10.56 us        11.22 us ( 6.26%)
   16384           28.99 us        12.88 us        13.83 us ( 7.37%)
   32768           40.13 us        16.76 us        16.95 us ( 1.16%)
   65536           68.70 us        24.68 us        24.85 us ( 0.68%)
   [bandwidth]
   msgsize                tcp              smcr          smcr-use-virt-buf
   1                1.65 MB/s         1.59 MB/s         1.53 MB/s (-3.88%)
   2                3.32 MB/s         3.17 MB/s         3.08 MB/s (-2.67%)
   4                6.66 MB/s         6.33 MB/s         6.09 MB/s (-3.85%)
   8               13.67 MB/s        13.45 MB/s        11.97 MB/s (-10.99%)
   16              25.36 MB/s        27.15 MB/s        24.16 MB/s (-11.01%)
   32              48.22 MB/s        54.24 MB/s        49.41 MB/s (-8.89%)
   64             106.79 MB/s       107.32 MB/s        99.05 MB/s (-7.71%)
   128            210.21 MB/s       202.46 MB/s       201.02 MB/s (-0.71%)
   256            400.81 MB/s       416.81 MB/s       393.52 MB/s (-5.59%)
   512            746.49 MB/s       834.12 MB/s       809.99 MB/s (-2.89%)
   1024          1292.33 MB/s      1641.96 MB/s      1571.82 MB/s (-4.27%)
   2048          2007.64 MB/s      2760.44 MB/s      2717.68 MB/s (-1.55%)
   4096          2665.17 MB/s      4157.44 MB/s      4070.76 MB/s (-2.09%)
   8192          3159.72 MB/s      4361.57 MB/s      4270.65 MB/s (-2.08%)
   16384         4186.70 MB/s      4574.13 MB/s      4501.17 MB/s (-1.60%)
   32768         4093.21 MB/s      4487.42 MB/s      4322.43 MB/s (-3.68%)
   65536         4057.14 MB/s      4735.61 MB/s      4555.17 MB/s (-3.81%)

2) regression in buffer initialization and destruction path, which is
   brought by additional MR operations of sndbufs. But thanks to link
   group buffer reuse mechanism, the impact of this kind of regression
   decreases as times of buffer reuse increases.

   Taking 256KB sndbuf and RMB as an example, latency of some key SMC-R
   buffer-related function obtained by bpftrace are as follows:

   Function                         Phys-bufs           Virt-bufs
   smcr_new_buf_create()             67154 ns            79164 ns
   smc_ib_buf_map_sg()                 525 ns              928 ns
   smc_ib_get_memory_region()       162294 ns           161191 ns
   smc_wr_reg_send()                  9957 ns             9635 ns
   smc_ib_put_memory_region()       203548 ns           198374 ns
   smc_ib_buf_unmap_sg()               508 ns             1158 ns

------------
Test environment notes:
1. Above tests run on 2 VMs within the same Host.
2. The NIC is ConnectX-4Lx, using SRIOV and passing through 2 VFs to
   the each VM respectively.
3. VMs' vCPUs are binded to different physical CPUs, and the binded
   physical CPUs are isolated by `isolcpus=xxx` cmdline.
4. NICs' queue number are set to 1.
Signed-off-by: NWen Gu <guwen@linux.alibaba.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b8d19945

net/smc: remove redundant dma sync ops · 6d52e2de

由 Guangguan Wang 提交于 7月 14, 2022

smc_ib_sync_sg_for_cpu/device are the ops used for dma memory cache
consistency. Smc sndbufs are dma buffers, where CPU writes data to
it and PCIE device reads data from it. So for sndbufs,
smc_ib_sync_sg_for_device is needed and smc_ib_sync_sg_for_cpu is
redundant as PCIE device will not write the buffers. Smc rmbs
are dma buffers, where PCIE device write data to it and CPU read
data from it. So for rmbs, smc_ib_sync_sg_for_cpu is needed and
smc_ib_sync_sg_for_device is redundant as CPU will not write the buffers.
Signed-off-by: NGuangguan Wang <guangguan.wang@linux.alibaba.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6d52e2de

26 5月, 2022 1 次提交

net/smc: set ini->smcrv2.ib_dev_v2 to NULL if SMC-Rv2 is unavailable · b3b1a175

由 liuyacan 提交于 5月 25, 2022

In the process of checking whether RDMAv2 is available, the current
implementation first sets ini->smcrv2.ib_dev_v2, and then allocates
smc buf desc and register rmb, but the latter may fail. In this case,
the pointer should be reset.

Fixes: e49300a6 ("net/smc: add listen processing for SMC-Rv2")
Signed-off-by: Nliuyacan <liuyacan@corp.netease.com>
Reviewed-by: NKarsten Graul <kgraul@linux.ibm.com>
Link: https://lore.kernel.org/r/20220525085408.812273-1-liuyacan@corp.netease.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

b3b1a175

25 5月, 2022 1 次提交

Revert "net/smc: fix listen processing for SMC-Rv2" · 9029ac03

由 liuyacan 提交于 5月 24, 2022

This reverts commit 8c3b8dc5.

Some rollback issue will be fixed in other patches in the future.

Link: https://lore.kernel.org/all/20220523055056.2078994-1-liuyacan@corp.netease.com/

Fixes: 8c3b8dc5 ("net/smc: fix listen processing for SMC-Rv2")
Signed-off-by: Nliuyacan <liuyacan@corp.netease.com>
Link: https://lore.kernel.org/r/20220524090230.2140302-1-liuyacan@corp.netease.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

9029ac03

23 5月, 2022 2 次提交

net/smc: fix listen processing for SMC-Rv2 · 8c3b8dc5

由 liuyacan 提交于 5月 23, 2022

In the process of checking whether RDMAv2 is available, the current
implementation first sets ini->smcrv2.ib_dev_v2, and then allocates
smc buf desc, but the latter may fail. Unfortunately, the caller
will only check the former. In this case, a NULL pointer reference
will occur in smc_clc_send_confirm_accept() when accessing
conn->rmb_desc.

This patch does two things:
1. Use the return code to determine whether V2 is available.
2. If the return code is NODEV, continue to check whether V1 is
available.

Fixes: e49300a6 ("net/smc: add listen processing for SMC-Rv2")
Signed-off-by: Nliuyacan <liuyacan@corp.netease.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8c3b8dc5

net/smc: postpone sk_refcnt increment in connect() · 75c1edf2

由 liuyacan 提交于 5月 23, 2022

Same trigger condition as commit 86434744. When setsockopt runs
in parallel to a connect(), and switch the socket into fallback
mode. Then the sk_refcnt is incremented in smc_connect(), but
its state stay in SMC_INIT (NOT SMC_ACTIVE). This cause the
corresponding sk_refcnt decrement in __smc_release() will not be
performed.

Fixes: 86434744 ("net/smc: add fallback check to connect()")
Signed-off-by: Nliuyacan <liuyacan@corp.netease.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

75c1edf2

16 5月, 2022 1 次提交

net/smc: align the connect behaviour with TCP · 3aba1030

由 Guangguan Wang 提交于 5月 13, 2022

Connect with O_NONBLOCK will not be completed immediately
and returns -EINPROGRESS. It is possible to use selector/poll
for completion by selecting the socket for writing. After select
indicates writability, a second connect function call will return
0 to indicate connected successfully as TCP does, but smc returns
-EISCONN. Use socket state for smc to indicate connect state, which
can help smc aligning the connect behaviour with TCP.
Signed-off-by: NGuangguan Wang <guangguan.wang@linux.alibaba.com>
Acked-by: NKarsten Graul <kgraul@linux.ibm.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3aba1030

26 4月, 2022 2 次提交

net/smc: Fix slab-out-of-bounds issue in fallback · 0558226c

由 Wen Gu 提交于 4月 22, 2022

syzbot reported a slab-out-of-bounds/use-after-free issue,
which was caused by accessing an already freed smc sock in
fallback-specific callback functions of clcsock.

This patch fixes the issue by restoring fallback-specific
callback functions to original ones and resetting clcsock
sk_user_data to NULL before freeing smc sock.

Meanwhile, this patch introduces sk_callback_lock to make
the access and assignment to sk_user_data mutually exclusive.

Reported-by: syzbot+b425899ed22c6943e00b@syzkaller.appspotmail.com
Fixes: 341adeec ("net/smc: Forward wakeup to smc socket waitqueue after fallback")
Link: https://lore.kernel.org/r/00000000000013ca8105d7ae3ada@google.com/Signed-off-by: NWen Gu <guwen@linux.alibaba.com>
Acked-by: NKarsten Graul <kgraul@linux.ibm.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

0558226c

net/smc: Only save the original clcsock callback functions · 97b9af7a

由 Wen Gu 提交于 4月 22, 2022

Both listen and fallback process will save the current clcsock
callback functions and establish new ones. But if both of them
happen, the saved callback functions will be overwritten.

So this patch introduces some helpers to ensure that only save
the original callback functions of clcsock.

Fixes: 341adeec ("net/smc: Forward wakeup to smc socket waitqueue after fallback")
Signed-off-by: NWen Gu <guwen@linux.alibaba.com>
Acked-by: NKarsten Graul <kgraul@linux.ibm.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

97b9af7a

25 4月, 2022 1 次提交

net/smc: sync err code when tcp connection was refused · 4e2e65e2

由 liuyacan 提交于 4月 21, 2022

In the current implementation, when TCP initiates a connection
to an unavailable [ip,port], ECONNREFUSED will be stored in the
TCP socket, but SMC will not. However, some apps (like curl) use
getsockopt(,,SO_ERROR,,) to get the error information, which makes
them miss the error message and behave strangely.

Fixes: 50717a37 ("net/smc: nonblocking connect rework")
Signed-off-by: Nliuyacan <liuyacan@corp.netease.com>
Reviewed-by: NTony Lu <tonylu@linux.alibaba.com>
Acked-by: NKarsten Graul <kgraul@linux.ibm.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4e2e65e2

15 4月, 2022 1 次提交

net/smc: Fix sock leak when release after smc_shutdown() · 1a74e993

由 Tony Lu 提交于 4月 14, 2022

Since commit e5d5aadc ("net/smc: fix sk_refcnt underflow on linkdown
and fallback"), for a fallback connection, __smc_release() does not call
sock_put() if its state is already SMC_CLOSED.

When calling smc_shutdown() after falling back, its state is set to
SMC_CLOSED but does not call sock_put(), so this patch calls it.

Reported-and-tested-by: syzbot+6e29a053eb165bd50de5@syzkaller.appspotmail.com
Fixes: e5d5aadc ("net/smc: fix sk_refcnt underflow on linkdown and fallback")
Signed-off-by: NTony Lu <tonylu@linux.alibaba.com>
Acked-by: NKarsten Graul <kgraul@linux.ibm.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1a74e993

12 4月, 2022 1 次提交

net/smc: Fix af_ops of child socket pointing to released memory · 49b7d376

由 Karsten Graul 提交于 4月 08, 2022

Child sockets may inherit the af_ops from the parent listen socket.
When the listen socket is released then the af_ops of the child socket
points to released memory.
Solve that by restoring the original af_ops for child sockets which
inherited the parent af_ops. And clear any inherited user_data of the
parent socket.

Fixes: 8270d9c2 ("net/smc: Limit backlog connections")
Reviewed-by: NWenjia Zhang <wenjia@linux.ibm.com>
Signed-off-by: NKarsten Graul <kgraul@linux.ibm.com>
Reviewed-by: ND. Wythe <alibuda@linux.alibaba.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

49b7d376

07 3月, 2022 1 次提交

net/smc: fix compile warning for smc_sysctl · 7de8eb0d

由 Dust Li 提交于 3月 07, 2022

kernel test robot reports multiple warning for smc_sysctl:

  In file included from net/smc/smc_sysctl.c:17:
>> net/smc/smc_sysctl.h:23:5: warning: no previous prototype \
	for function 'smc_sysctl_init' [-Wmissing-prototypes]
  int smc_sysctl_init(void)
       ^
and
  >> WARNING: modpost: vmlinux.o(.text+0x12ced2d): Section mismatch \
  in reference from the function smc_sysctl_exit() to the variable
  .init.data:smc_sysctl_ops
  The function smc_sysctl_exit() references
  the variable __initdata smc_sysctl_ops.
  This is often because smc_sysctl_exit lacks a __initdata
  annotation or the annotation of smc_sysctl_ops is wrong.

and
  net/smc/smc_sysctl.c: In function 'smc_sysctl_init_net':
  net/smc/smc_sysctl.c:47:17: error: 'struct netns_smc' has no member named 'smc_hdr'
     47 |         net->smc.smc_hdr = register_net_sysctl(net, "net/smc", table);

Since we don't need global sysctl initialization. To make things
clean and simple, remove the global pernet_operations and
smc_sysctl_{init|exit}. Call smc_sysctl_net_{init|exit} directly
from smc_net_{init|exit}.

Also initialized sysctl_autocorking_size if CONFIG_SYSCTL it not
set, this make sure SMC autocorking is enabled by default if
CONFIG_SYSCTL is not set.

Fixes: 462791bb ("net/smc: add sysctl interface for SMC")
Reported-by: Nkernel test robot <lkp@intel.com>
Signed-off-by: NDust Li <dust.li@linux.alibaba.com>
Tested-by: Randy Dunlap <rdunlap@infradead.org> # build-tested
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7de8eb0d

01 3月, 2022 3 次提交

net/smc: don't send in the BH context if sock_owned_by_user · 6b88af83

由 Dust Li 提交于 3月 01, 2022

Send data all the way down to the RDMA device is a time
consuming operation(get a new slot, maybe do RDMA Write
and send a CDC, etc). Moving those operations from BH
to user context is good for performance.

If the sock_lock is hold by user, we don't try to send
data out in the BH context, but just mark we should
send. Since the user will release the sock_lock soon, we
can do the sending there.

Add smc_release_cb() which will be called in release_sock()
and try send in the callback if needed.

This patch moves the sending part out from BH if sock lock
is hold by user. In my testing environment, this saves about
20% softirq in the qperf 4K tcp_bw test in the sender side
with no noticeable throughput drop.
Signed-off-by: NDust Li <dust.li@linux.alibaba.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6b88af83

net/smc: send directly on setting TCP_NODELAY · b70a5cc0

由 Dust Li 提交于 3月 01, 2022

In commit ea785a1a("net/smc: Send directly when
TCP_CORK is cleared"), we don't use delayed work
to implement cork.

This patch use the same algorithm, removes the
delayed work when setting TCP_NODELAY and send
directly in setsockopt(). This also makes the
TCP_NODELAY the same as TCP.

Cc: Tony Lu <tonylu@linux.alibaba.com>
Signed-off-by: NDust Li <dust.li@linux.alibaba.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b70a5cc0

net/smc: add sysctl interface for SMC · 462791bb

由 Dust Li 提交于 3月 01, 2022

This patch add sysctl interface to support container environment
for SMC as we talk in the mail list.

Link: https://lore.kernel.org/netdev/20220224020253.GF5443@linux.alibaba.comCo-developed-by: NTony Lu <tonylu@linux.alibaba.com>
Signed-off-by: NTony Lu <tonylu@linux.alibaba.com>
Signed-off-by: NDust Li <dust.li@linux.alibaba.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

462791bb

28 2月, 2022 1 次提交

net/smc: Fix cleanup when register ULP fails · 4d08b7b5

由 Tony Lu 提交于 2月 25, 2022

This patch calls smc_ib_unregister_client() when tcp_register_ulp()
fails, and make sure to clean it up.

Fixes: d7cd421d ("net/smc: Introduce TCP ULP support")
Signed-off-by: NTony Lu <tonylu@linux.alibaba.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4d08b7b5

25 2月, 2022 1 次提交

net/smc: fix connection leak · 9f1c50cf

由 D. Wythe 提交于 2月 24, 2022

There's a potential leak issue under following execution sequence :

smc_release  				smc_connect_work
if (sk->sk_state == SMC_INIT)
					send_clc_confirim
	tcp_abort();
					...
					sk.sk_state = SMC_ACTIVE
smc_close_active
switch(sk->sk_state) {
...
case SMC_ACTIVE:
	smc_close_final()
	// then wait peer closed

Unfortunately, tcp_abort() may discard CLC CONFIRM messages that are
still in the tcp send buffer, in which case our connection token cannot
be delivered to the server side, which means that we cannot get a
passive close message at all. Therefore, it is impossible for the to be
disconnected at all.

This patch tries a very simple way to avoid this issue, once the state
has changed to SMC_ACTIVE after tcp_abort(), we can actively abort the
smc connection, considering that the state is SMC_INIT before
tcp_abort(), abandoning the complete disconnection process should not
cause too much problem.

In fact, this problem may exist as long as the CLC CONFIRM message is
not received by the server. Whether a timer should be added after
smc_close_final() needs to be discussed in the future. But even so, this
patch provides a faster release for connection in above case, it should
also be valuable.

Fixes: 39f41f36 ("net/smc: common release code for non-accepted sockets")
Signed-off-by: ND. Wythe <alibuda@linux.alibaba.com>
Acked-by: NKarsten Graul <kgraul@linux.ibm.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9f1c50cf

20 2月, 2022 1 次提交

net/smc: unlock on error paths in __smc_setsockopt() · 7a11455f

由 Dan Carpenter 提交于 2月 18, 2022

These two error paths need to release_sock(sk) before returning.

Fixes: a6a6fe27 ("net/smc: Dynamic control handshake limitation by socket options")
Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: ND. Wythe <alibuda@linux.alibaba.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7a11455f

17 2月, 2022 1 次提交

net/smc: return ETIMEDOUT when smc_connect_clc() timeout · 1ce22047

由 D. Wythe 提交于 2月 15, 2022

When smc_connect_clc() times out, it will return -EAGAIN(tcp_recvmsg
retuns -EAGAIN while timeout), then this value will passed to the
application, which is quite confusing to the applications, makes
inconsistency with TCP.

From the manual of connect, ETIMEDOUT is more suitable, and this patch
try convert EAGAIN to ETIMEDOUT in that case.
Signed-off-by: ND. Wythe <alibuda@linux.alibaba.com>
Reviewed-by: NKarsten Graul <kgraul@linux.ibm.com>
Link: https://lore.kernel.org/r/1644913490-21594-1-git-send-email-alibuda@linux.alibaba.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

1ce22047

11 2月, 2022 6 次提交

net/smc: Add global configure for handshake limitation by netlink · f9496b7c

由 D. Wythe 提交于 2月 10, 2022

Although we can control SMC handshake limitation through socket options,
which means that applications who need it must modify their code. It's
quite troublesome for many existing applications. This patch modifies
the global default value of SMC handshake limitation through netlink,
providing a way to put constraint on handshake without modifies any code
for applications.
Suggested-by: NTony Lu <tonylu@linux.alibaba.com>
Signed-off-by: ND. Wythe <alibuda@linux.alibaba.com>
Reviewed-by: NTony Lu <tonylu@linux.alibaba.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f9496b7c

net/smc: Dynamic control handshake limitation by socket options · a6a6fe27

由 D. Wythe 提交于 2月 10, 2022

This patch aims to add dynamic control for SMC handshake limitation for
every smc sockets, in production environment, it is possible for the
same applications to handle different service types, and may have
different opinion on SMC handshake limitation.

This patch try socket options to complete it, since we don't have socket
option level for SMC yet, which requires us to implement it at the same
time.

This patch does the following:

- add new socket option level: SOL_SMC.
- add new SMC socket option: SMC_LIMIT_HS.
- provide getter/setter for SMC socket options.

Link: https://lore.kernel.org/all/20f504f961e1a803f85d64229ad84260434203bd.1644323503.git.alibuda@linux.alibaba.com/Signed-off-by: ND. Wythe <alibuda@linux.alibaba.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a6a6fe27

net/smc: Limit SMC visits when handshake workqueue congested · 48b6190a

由 D. Wythe 提交于 2月 10, 2022

This patch intends to provide a mechanism to put constraint on SMC
connections visit according to the pressure of SMC handshake process.
At present, frequent visits will cause the incoming connections to be
backlogged in SMC handshake queue, raise the connections established
time. Which is quite unacceptable for those applications who base on
short lived connections.

There are two ways to implement this mechanism:

1. Put limitation after TCP established.
2. Put limitation before TCP established.

In the first way, we need to wait and receive CLC messages that the
client will potentially send, and then actively reply with a decline
message, in a sense, which is also a sort of SMC handshake, affect the
connections established time on its way.

In the second way, the only problem is that we need to inject SMC logic
into TCP when it is about to reply the incoming SYN, since we already do
that, it's seems not a problem anymore. And advantage is obvious, few
additional processes are required to complete the constraint.

This patch use the second way. After this patch, connections who beyond
constraint will not informed any SMC indication, and SMC will not be
involved in any of its subsequent processes.

Link: https://lore.kernel.org/all/1641301961-59331-1-git-send-email-alibuda@linux.alibaba.com/Signed-off-by: ND. Wythe <alibuda@linux.alibaba.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

48b6190a

net/smc: Limit backlog connections · 8270d9c2

由 D. Wythe 提交于 2月 10, 2022

Current implementation does not handling backlog semantics, one
potential risk is that server will be flooded by infinite amount
connections, even if client was SMC-incapable.

This patch works to put a limit on backlog connections, referring to the
TCP implementation, we divides SMC connections into two categories:

1. Half SMC connection, which includes all TCP established while SMC not
connections.

2. Full SMC connection, which includes all SMC established connections.

For half SMC connection, since all half SMC connections starts with TCP
established, we can achieve our goal by put a limit before TCP
established. Refer to the implementation of TCP, this limits will based
on not only the half SMC connections but also the full connections,
which is also a constraint on full SMC connections.

For full SMC connections, although we know exactly where it starts, it's
quite hard to put a limit before it. The easiest way is to block wait
before receive SMC confirm CLC message, while it's under protection by
smc_server_lgr_pending, a global lock, which leads this limit to the
entire host instead of a single listen socket. Another way is to drop
the full connections, but considering the cast of SMC connections, we
prefer to keep full SMC connections.

Even so, the limits of full SMC connections still exists, see commits
about half SMC connection below.

After this patch, the limits of backend connection shows like:

For SMC:

1. Client with SMC-capability can makes 2 * backlog full SMC connections
   or 1 * backlog half SMC connections and 1 * backlog full SMC
   connections at most.

2. Client without SMC-capability can only makes 1 * backlog half TCP
   connections and 1 * backlog full TCP connections.
Signed-off-by: ND. Wythe <alibuda@linux.alibaba.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8270d9c2

net/smc: Make smc_tcp_listen_work() independent · 3079e342

由 D. Wythe 提交于 2月 10, 2022

In multithread and 10K connections benchmark, the backend TCP connection
established very slowly, and lots of TCP connections stay in SYN_SENT
state.

Client: smc_run wrk -c 10000 -t 4 http://server

the netstate of server host shows like:
    145042 times the listen queue of a socket overflowed
    145042 SYNs to LISTEN sockets dropped

One reason of this issue is that, since the smc_tcp_listen_work() shared
the same workqueue (smc_hs_wq) with smc_listen_work(), while the
smc_listen_work() do blocking wait for smc connection established. Once
the workqueue became congested, it's will block the accept() from TCP
listen.

This patch creates a independent workqueue(smc_tcp_ls_wq) for
smc_tcp_listen_work(), separate it from smc_listen_work(), which is
quite acceptable considering that smc_tcp_listen_work() runs very fast.
Signed-off-by: ND. Wythe <alibuda@linux.alibaba.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3079e342

net/smc: Avoid overwriting the copies of clcsock callback functions · 1de9770d

由 Wen Gu 提交于 2月 09, 2022

The callback functions of clcsock will be saved and replaced during
the fallback. But if the fallback happens more than once, then the
copies of these callback functions will be overwritten incorrectly,
resulting in a loop call issue:

clcsk->sk_error_report
 |- smc_fback_error_report() <------------------------------|
     |- smc_fback_forward_wakeup()                          | (loop)
         |- clcsock_callback()  (incorrectly overwritten)   |
             |- smc->clcsk_error_report() ------------------|

So this patch fixes the issue by saving these function pointers only
once in the fallback and avoiding overwriting.

Reported-by: syzbot+4de3c0e8a263e1e499bc@syzkaller.appspotmail.com
Fixes: 341adeec ("net/smc: Forward wakeup to smc socket waitqueue after fallback")
Link: https://lore.kernel.org/r/0000000000006d045e05d78776f6@google.comSigned-off-by: NWen Gu <guwen@linux.alibaba.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1de9770d

31 1月, 2022 3 次提交

net/smc: Cork when sendpage with MSG_SENDPAGE_NOTLAST flag · be9a16cc

由 Tony Lu 提交于 1月 31, 2022

This introduces a new corked flag, MSG_SENDPAGE_NOTLAST, which is
involved in syscall sendfile() [1], it indicates this is not the last
page. So we can cork the data until the page is not specify this flag.
It has the same effect as MSG_MORE, but existed in sendfile() only.

This patch adds a option MSG_SENDPAGE_NOTLAST for corking data, try to
cork more data before sending when using sendfile(), which acts like
TCP's behaviour. Also, this reimplements the default sendpage to inform
that it is supported to some extent.

[1] https://man7.org/linux/man-pages/man2/sendfile.2.htmlSigned-off-by: NTony Lu <tonylu@linux.alibaba.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

be9a16cc

net/smc: Send directly when TCP_CORK is cleared · ea785a1a

由 Tony Lu 提交于 1月 31, 2022

According to the man page of TCP_CORK [1], if set, don't send out
partial frames. All queued partial frames are sent when option is
cleared again.

When applications call setsockopt to disable TCP_CORK, this call is
protected by lock_sock(), and tries to mod_delayed_work() to 0, in order
to send pending data right now. However, the delayed work smc_tx_work is
also protected by lock_sock(). There introduces lock contention for
sending data.

To fix it, send pending data directly which acts like TCP, without
lock_sock() protected in the context of setsockopt (already lock_sock()ed),
and cancel unnecessary dealyed work, which is protected by lock.

[1] https://linux.die.net/man/7/tcpSigned-off-by: NTony Lu <tonylu@linux.alibaba.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ea785a1a

net/smc: Forward wakeup to smc socket waitqueue after fallback · 341adeec

由 Wen Gu 提交于 1月 26, 2022

When we replace TCP with SMC and a fallback occurs, there may be
some socket waitqueue entries remaining in smc socket->wq, such
as eppoll_entries inserted by userspace applications.

After the fallback, data flows over TCP/IP and only clcsocket->wq
will be woken up. Applications can't be notified by the entries
which were inserted in smc socket->wq before fallback. So we need
a mechanism to wake up smc socket->wq at the same time if some
entries remaining in it.

The current workaround is to transfer the entries from smc socket->wq
to clcsock->wq during the fallback. But this may cause a crash
like this:

general protection fault, probably for non-canonical address 0xdead000000000100: 0000 [#1] PREEMPT SMP PTI
CPU: 3 PID: 0 Comm: swapper/3 Kdump: loaded Tainted: G E 5.16.0+ #107
RIP: 0010:__wake_up_common+0x65/0x170
Call Trace:
<IRQ>
__wake_up_common_lock+0x7a/0xc0
sock_def_readable+0x3c/0x70
tcp_data_queue+0x4a7/0xc40
tcp_rcv_established+0x32f/0x660
? sk_filter_trim_cap+0xcb/0x2e0
tcp_v4_do_rcv+0x10b/0x260
tcp_v4_rcv+0xd2a/0xde0
ip_protocol_deliver_rcu+0x3b/0x1d0
ip_local_deliver_finish+0x54/0x60
ip_local_deliver+0x6a/0x110
? tcp_v4_early_demux+0xa2/0x140
? tcp_v4_early_demux+0x10d/0x140
ip_sublist_rcv_finish+0x49/0x60
ip_sublist_rcv+0x19d/0x230
ip_list_rcv+0x13e/0x170
__netif_receive_skb_list_core+0x1c2/0x240
netif_receive_skb_list_internal+0x1e6/0x320
napi_complete_done+0x11d/0x190
mlx5e_napi_poll+0x163/0x6b0 [mlx5_core]
__napi_poll+0x3c/0x1b0
net_rx_action+0x27c/0x300
__do_softirq+0x114/0x2d2
irq_exit_rcu+0xb4/0xe0
common_interrupt+0xba/0xe0
</IRQ>
<TASK>

The crash is caused by privately transferring waitqueue entries from
smc socket->wq to clcsock->wq. The owners of these entries, such as
epoll, have no idea that the entries have been transferred to a
different socket wait queue and still use original waitqueue spinlock
(smc socket->wq.wait.lock) to make the entries operation exclusive,
but it doesn't work. The operations to the entries, such as removing
from the waitqueue (now is clcsock->wq after fallback), may cause a
crash when clcsock waitqueue is being iterated over at the moment.

This patch tries to fix this by no longer transferring wait queue
entries privately, but introducing own implementations of clcsock's
callback functions in fallback situation. The callback functions will
forward the wakeup to smc socket->wq if clcsock->wq is actually woken
up and smc socket->wq has remaining entries.

Fixes: 2153bd1e ("net/smc: Transfer remaining wait queue entries during fallback")
Suggested-by: NKarsten Graul <kgraul@linux.ibm.com>
Signed-off-by: NWen Gu <guwen@linux.alibaba.com>
Acked-by: NKarsten Graul <kgraul@linux.ibm.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

341adeec

24 1月, 2022 1 次提交

net/smc: Transitional solution for clcsock race issue · c0bf3d8a

由 Wen Gu 提交于 1月 22, 2022

We encountered a crash in smc_setsockopt() and it is caused by
accessing smc->clcsock after clcsock was released.

 BUG: kernel NULL pointer dereference, address: 0000000000000020
 #PF: supervisor read access in kernel mode
 #PF: error_code(0x0000) - not-present page
 PGD 0 P4D 0
 Oops: 0000 [#1] PREEMPT SMP PTI
 CPU: 1 PID: 50309 Comm: nginx Kdump: loaded Tainted: G E     5.16.0-rc4+ #53
 RIP: 0010:smc_setsockopt+0x59/0x280 [smc]
 Call Trace:
  <TASK>
  __sys_setsockopt+0xfc/0x190
  __x64_sys_setsockopt+0x20/0x30
  do_syscall_64+0x34/0x90
  entry_SYSCALL_64_after_hwframe+0x44/0xae
 RIP: 0033:0x7f16ba83918e
  </TASK>

This patch tries to fix it by holding clcsock_release_lock and
checking whether clcsock has already been released before access.

In case that a crash of the same reason happens in smc_getsockopt()
or smc_switch_to_fallback(), this patch also checkes smc->clcsock
in them too. And the caller of smc_switch_to_fallback() will identify
whether fallback succeeds according to the return value.

Fixes: fd57770d ("net/smc: wait for pending work before clcsock release_sock")
Link: https://lore.kernel.org/lkml/5dd7ffd1-28e2-24cc-9442-1defec27375e@linux.ibm.com/T/Signed-off-by: NWen Gu <guwen@linux.alibaba.com>
Acked-by: NKarsten Graul <kgraul@linux.ibm.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c0bf3d8a

13 1月, 2022 1 次提交

net/smc: Introduce a new conn->lgr validity check helper · ea89c6c0

由 Wen Gu 提交于 1月 13, 2022

It is no longer suitable to identify whether a smc connection
is registered in a link group through checking if conn->lgr
is NULL, because conn->lgr won't be reset even the connection
is unregistered from a link group.

So this patch introduces a new helper smc_conn_lgr_valid() and
replaces all the check of conn->lgr in original implementation
with the new helper to judge if conn->lgr is valid to use.
Signed-off-by: NWen Gu <guwen@linux.alibaba.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ea89c6c0

06 1月, 2022 1 次提交

net/smc: Reset conn->lgr when link group registration fails · 36595d8a

由 Wen Gu 提交于 1月 06, 2022

SMC connections might fail to be registered in a link group due to
unable to find a usable link during its creation. As a result,
smc_conn_create() will return a failure and most resources related
to the connection won't be applied or initialized, such as
conn->abort_work or conn->lnk.

If smc_conn_free() is invoked later, it will try to access the
uninitialized resources related to the connection, thus causing
a warning or crash.

This patch tries to fix this by resetting conn->lgr to NULL if an
abnormal exit occurs in smc_lgr_register_conn(), thus avoiding the
access to uninitialized resources in smc_conn_free().

Meanwhile, the new created link group should be terminated if smc
connections can't be registered in it. So smc_lgr_cleanup_early() is
modified to take care of link group only and invoked to terminate
unusable link group by smc_conn_create(). The call to smc_conn_free()
is moved out from smc_lgr_cleanup_early() to smc_conn_abort().

Fixes: 56bc3b20 ("net/smc: assign link to a new connection")
Suggested-by: NKarsten Graul <kgraul@linux.ibm.com>
Signed-off-by: NWen Gu <guwen@linux.alibaba.com>
Acked-by: NKarsten Graul <kgraul@linux.ibm.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

36595d8a

02 1月, 2022 1 次提交

net/smc: Introduce TCP ULP support · d7cd421d

由 Tony Lu 提交于 12月 28, 2021

This implements TCP ULP for SMC, helps applications to replace TCP with
SMC protocol in place. And we use it to implement transparent
replacement.

This replaces original TCP sockets with SMC, reuse TCP as clcsock when
calling setsockopt with TCP_ULP option, and without any overhead.

To replace TCP sockets with SMC, there are two approaches:

- use setsockopt() syscall with TCP_ULP option, if error, it would
  fallback to TCP.

- use BPF prog with types BPF_CGROUP_INET_SOCK_CREATE or others to
  replace transparently. BPF hooks some points in create socket, bind
  and others, users can inject their BPF logics without modifying their
  applications, and choose which connections should be replaced with SMC
  by calling setsockopt() in BPF prog, based on rules, such as TCP tuples,
  PID, cgroup, etc...

  BPF doesn't support calling setsockopt with TCP_ULP now, I will send the
  patches after this accepted.
Signed-off-by: NTony Lu <tonylu@linux.alibaba.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d7cd421d

17 12月, 2021 1 次提交

net/smc: Prevent smc_release() from long blocking · 5c15b312

由 D. Wythe 提交于 12月 15, 2021

In nginx/wrk benchmark, there's a hung problem with high probability
on case likes that: (client will last several minutes to exit)

server: smc_run nginx

client: smc_run wrk -c 10000 -t 1 http://server

Client hangs with the following backtrace:

0 [ffffa7ce8Of3bbf8] __schedule at ffffffff9f9eOd5f
1 [ffffa7ce8Of3bc88] schedule at ffffffff9f9eløe6
2 [ffffa7ce8Of3bcaO] schedule_timeout at ffffffff9f9e3f3c
3 [ffffa7ce8Of3bd2O] wait_for_common at ffffffff9f9el9de
4 [ffffa7ce8Of3bd8O] __flush_work at ffffffff9fOfeOl3
5 [ffffa7ce8øf3bdfO] smc_release at ffffffffcO697d24 [smc]
6 [ffffa7ce8Of3be2O] __sock_release at ffffffff9f8O2e2d
7 [ffffa7ce8Of3be4ø] sock_close at ffffffff9f8ø2ebl
8 [ffffa7ce8øf3be48] __fput at ffffffff9f334f93
9 [ffffa7ce8Of3be78] task_work_run at ffffffff9flOlff5
10 [ffffa7ce8Of3beaO] do_exit at ffffffff9fOe5Ol2
11 [ffffa7ce8Of3bflO] do_group_exit at ffffffff9fOe592a
12 [ffffa7ce8Of3bf38] __x64_sys_exit_group at ffffffff9fOe5994
13 [ffffa7ce8Of3bf4O] do_syscall_64 at ffffffff9f9d4373
14 [ffffa7ce8Of3bfsO] entry_SYSCALL_64_after_hwframe at ffffffff9fa0007c

This issue dues to flush_work(), which is used to wait for
smc_connect_work() to finish in smc_release(). Once lots of
smc_connect_work() was pending or all executing work dangling,
smc_release() has to block until one worker comes to free, which
is equivalent to wait another smc_connnect_work() to finish.

In order to fix this, There are two changes:

1. For those idle smc_connect_work(), cancel it from the workqueue; for
   executing smc_connect_work(), waiting for it to finish. For that
   purpose, replace flush_work() with cancel_work_sync().

2. Since smc_connect() hold a reference for passive closing, if
   smc_connect_work() has been cancelled, release the reference.

Fixes: 24ac3a08 ("net/smc: rebuild nonblocking connect")
Reported-by: NTony Lu <tonylu@linux.alibaba.com>
Tested-by: NDust Li <dust.li@linux.alibaba.com>
Reviewed-by: NDust Li <dust.li@linux.alibaba.com>
Reviewed-by: NTony Lu <tonylu@linux.alibaba.com>
Signed-off-by: ND. Wythe <alibuda@linux.alibaba.com>
Acked-by: NKarsten Graul <kgraul@linux.ibm.com>
Link: https://lore.kernel.org/r/1639571361-101128-1-git-send-email-alibuda@linux.alibaba.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

5c15b312

openeuler / Kernel 2 年多 前同步成功

openeuler / Kernel
2 年多前同步成功