1. 20 2月, 2023 1 次提交
    • D
      net/smc: fix potential panic dues to unprotected smc_llc_srv_add_link() · e40b801b
      D. Wythe 提交于
      There is a certain chance to trigger the following panic:
      
      PID: 5900   TASK: ffff88c1c8af4100  CPU: 1   COMMAND: "kworker/1:48"
       #0 [ffff9456c1cc79a0] machine_kexec at ffffffff870665b7
       #1 [ffff9456c1cc79f0] __crash_kexec at ffffffff871b4c7a
       #2 [ffff9456c1cc7ab0] crash_kexec at ffffffff871b5b60
       #3 [ffff9456c1cc7ac0] oops_end at ffffffff87026ce7
       #4 [ffff9456c1cc7ae0] page_fault_oops at ffffffff87075715
       #5 [ffff9456c1cc7b58] exc_page_fault at ffffffff87ad0654
       #6 [ffff9456c1cc7b80] asm_exc_page_fault at ffffffff87c00b62
          [exception RIP: ib_alloc_mr+19]
          RIP: ffffffffc0c9cce3  RSP: ffff9456c1cc7c38  RFLAGS: 00010202
          RAX: 0000000000000000  RBX: 0000000000000002  RCX: 0000000000000004
          RDX: 0000000000000010  RSI: 0000000000000000  RDI: 0000000000000000
          RBP: ffff88c1ea281d00   R8: 000000020a34ffff   R9: ffff88c1350bbb20
          R10: 0000000000000000  R11: 0000000000000001  R12: 0000000000000000
          R13: 0000000000000010  R14: ffff88c1ab040a50  R15: ffff88c1ea281d00
          ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
       #7 [ffff9456c1cc7c60] smc_ib_get_memory_region at ffffffffc0aff6df [smc]
       #8 [ffff9456c1cc7c88] smcr_buf_map_link at ffffffffc0b0278c [smc]
       #9 [ffff9456c1cc7ce0] __smc_buf_create at ffffffffc0b03586 [smc]
      
      The reason here is that when the server tries to create a second link,
      smc_llc_srv_add_link() has no protection and may add a new link to
      link group. This breaks the security environment protected by
      llc_conf_mutex.
      
      Fixes: 2d2209f2 ("net/smc: first part of add link processing as SMC server")
      Signed-off-by: ND. Wythe <alibuda@linux.alibaba.com>
      Reviewed-by: NLarysa Zaremba <larysa.zaremba@intel.com>
      Reviewed-by: NWenjia Zhang <wenjia@linux.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e40b801b
  2. 03 11月, 2022 1 次提交
  3. 27 9月, 2022 1 次提交
    • T
      net/smc: Support SO_REUSEPORT · 6627a207
      Tony Lu 提交于
      This enables SO_REUSEPORT [1] for clcsock when it is set on smc socket,
      so that some applications which uses it can be transparently replaced
      with SMC. Also, this helps improve load distribution.
      
      Here is a simple test of NGINX + wrk with SMC. The CPU usage is collected
      on NGINX (server) side as below.
      
      Disable SO_REUSEPORT:
      
      05:15:33 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
      05:15:34 PM  all    7.02    0.00   11.86    0.00    2.04    8.93    0.00    0.00    0.00   70.15
      05:15:34 PM    0    0.00    0.00    0.00    0.00   16.00   70.00    0.00    0.00    0.00   14.00
      05:15:34 PM    1   11.58    0.00   22.11    0.00    0.00    0.00    0.00    0.00    0.00   66.32
      05:15:34 PM    2    1.00    0.00    1.00    0.00    0.00    0.00    0.00    0.00    0.00   98.00
      05:15:34 PM    3   16.84    0.00   30.53    0.00    0.00    0.00    0.00    0.00    0.00   52.63
      05:15:34 PM    4   28.72    0.00   44.68    0.00    0.00    0.00    0.00    0.00    0.00   26.60
      05:15:34 PM    5    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
      05:15:34 PM    6    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
      05:15:34 PM    7    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
      
      Enable SO_REUSEPORT:
      
      05:15:20 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
      05:15:21 PM  all    8.56    0.00   14.40    0.00    2.20    9.86    0.00    0.00    0.00   64.98
      05:15:21 PM    0    0.00    0.00    4.08    0.00   14.29   76.53    0.00    0.00    0.00    5.10
      05:15:21 PM    1    9.09    0.00   16.16    0.00    1.01    0.00    0.00    0.00    0.00   73.74
      05:15:21 PM    2    9.38    0.00   16.67    0.00    1.04    0.00    0.00    0.00    0.00   72.92
      05:15:21 PM    3   10.42    0.00   17.71    0.00    1.04    0.00    0.00    0.00    0.00   70.83
      05:15:21 PM    4    9.57    0.00   15.96    0.00    0.00    0.00    0.00    0.00    0.00   74.47
      05:15:21 PM    5    9.18    0.00   15.31    0.00    0.00    1.02    0.00    0.00    0.00   74.49
      05:15:21 PM    6    8.60    0.00   15.05    0.00    0.00    0.00    0.00    0.00    0.00   76.34
      05:15:21 PM    7   12.37    0.00   14.43    0.00    0.00    0.00    0.00    0.00    0.00   73.20
      
      Using SO_REUSEPORT helps the load distribution of NGINX be more
      balanced.
      
      [1] https://man7.org/linux/man-pages/man7/socket.7.htmlSigned-off-by: NTony Lu <tonylu@linux.alibaba.com>
      Acked-by: NWenjia Zhang <wenjia@linux.ibm.com>
      Link: https://lore.kernel.org/r/20220922121906.72406-1-tonylu@linux.alibaba.comSigned-off-by: NPaolo Abeni <pabeni@redhat.com>
      6627a207
  4. 22 9月, 2022 1 次提交
    • T
      net/smc: Unbind r/w buffer size from clcsock and make them tunable · 0227f058
      Tony Lu 提交于
      Currently, SMC uses smc->sk.sk_{rcv|snd}buf to create buffers for
      send buffer and RMB. And the values of buffer size are from tcp_{w|r}mem
      in clcsock.
      
      The buffer size from TCP socket doesn't fit SMC well. Generally, buffers
      are usually larger than TCP for SMC-R/-D to get higher performance, for
      they are different underlay devices and paths.
      
      So this patch unbinds buffer size from TCP, and introduces two sysctl
      knobs to tune them independently. Also, these knobs are per net
      namespace and work for containers.
      Signed-off-by: NTony Lu <tonylu@linux.alibaba.com>
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      0227f058
  5. 01 9月, 2022 1 次提交
  6. 27 7月, 2022 1 次提交
  7. 18 7月, 2022 2 次提交
    • W
      net/smc: Allow virtually contiguous sndbufs or RMBs for SMC-R · b8d19945
      Wen Gu 提交于
      On long-running enterprise production servers, high-order contiguous
      memory pages are usually very rare and in most cases we can only get
      fragmented pages.
      
      When replacing TCP with SMC-R in such production scenarios, attempting
      to allocate high-order physically contiguous sndbufs and RMBs may result
      in frequent memory compaction, which will cause unexpected hung issue
      and further stability risks.
      
      So this patch is aimed to allow SMC-R link group to use virtually
      contiguous sndbufs and RMBs to avoid potential issues mentioned above.
      Whether to use physically or virtually contiguous buffers can be set
      by sysctl smcr_buf_type.
      
      Note that using virtually contiguous buffers will bring an acceptable
      performance regression, which can be mainly divided into two parts:
      
      1) regression in data path, which is brought by additional address
         translation of sndbuf by RNIC in Tx. But in general, translating
         address through MTT is fast.
      
         Taking 256KB sndbuf and RMB as an example, the comparisons in qperf
         latency and bandwidth test with physically and virtually contiguous
         buffers are as follows:
      
      - client:
        smc_run taskset -c <cpu> qperf <server> -oo msg_size:1:64K:*2\
        -t 5 -vu tcp_{bw|lat}
      - server:
        smc_run taskset -c <cpu> qperf
      
         [latency]
         msgsize              tcp            smcr        smcr-use-virt-buf
         1               11.17 us         7.56 us         7.51 us (-0.67%)
         2               10.65 us         7.74 us         7.56 us (-2.31%)
         4               11.11 us         7.52 us         7.59 us ( 0.84%)
         8               10.83 us         7.55 us         7.51 us (-0.48%)
         16              11.21 us         7.46 us         7.51 us ( 0.71%)
         32              10.65 us         7.53 us         7.58 us ( 0.61%)
         64              10.95 us         7.74 us         7.80 us ( 0.76%)
         128             11.14 us         7.83 us         7.87 us ( 0.47%)
         256             10.97 us         7.94 us         7.92 us (-0.28%)
         512             11.23 us         7.94 us         8.20 us ( 3.25%)
         1024            11.60 us         8.12 us         8.20 us ( 0.96%)
         2048            14.04 us         8.30 us         8.51 us ( 2.49%)
         4096            16.88 us         9.13 us         9.07 us (-0.64%)
         8192            22.50 us        10.56 us        11.22 us ( 6.26%)
         16384           28.99 us        12.88 us        13.83 us ( 7.37%)
         32768           40.13 us        16.76 us        16.95 us ( 1.16%)
         65536           68.70 us        24.68 us        24.85 us ( 0.68%)
         [bandwidth]
         msgsize                tcp              smcr          smcr-use-virt-buf
         1                1.65 MB/s         1.59 MB/s         1.53 MB/s (-3.88%)
         2                3.32 MB/s         3.17 MB/s         3.08 MB/s (-2.67%)
         4                6.66 MB/s         6.33 MB/s         6.09 MB/s (-3.85%)
         8               13.67 MB/s        13.45 MB/s        11.97 MB/s (-10.99%)
         16              25.36 MB/s        27.15 MB/s        24.16 MB/s (-11.01%)
         32              48.22 MB/s        54.24 MB/s        49.41 MB/s (-8.89%)
         64             106.79 MB/s       107.32 MB/s        99.05 MB/s (-7.71%)
         128            210.21 MB/s       202.46 MB/s       201.02 MB/s (-0.71%)
         256            400.81 MB/s       416.81 MB/s       393.52 MB/s (-5.59%)
         512            746.49 MB/s       834.12 MB/s       809.99 MB/s (-2.89%)
         1024          1292.33 MB/s      1641.96 MB/s      1571.82 MB/s (-4.27%)
         2048          2007.64 MB/s      2760.44 MB/s      2717.68 MB/s (-1.55%)
         4096          2665.17 MB/s      4157.44 MB/s      4070.76 MB/s (-2.09%)
         8192          3159.72 MB/s      4361.57 MB/s      4270.65 MB/s (-2.08%)
         16384         4186.70 MB/s      4574.13 MB/s      4501.17 MB/s (-1.60%)
         32768         4093.21 MB/s      4487.42 MB/s      4322.43 MB/s (-3.68%)
         65536         4057.14 MB/s      4735.61 MB/s      4555.17 MB/s (-3.81%)
      
      2) regression in buffer initialization and destruction path, which is
         brought by additional MR operations of sndbufs. But thanks to link
         group buffer reuse mechanism, the impact of this kind of regression
         decreases as times of buffer reuse increases.
      
         Taking 256KB sndbuf and RMB as an example, latency of some key SMC-R
         buffer-related function obtained by bpftrace are as follows:
      
         Function                         Phys-bufs           Virt-bufs
         smcr_new_buf_create()             67154 ns            79164 ns
         smc_ib_buf_map_sg()                 525 ns              928 ns
         smc_ib_get_memory_region()       162294 ns           161191 ns
         smc_wr_reg_send()                  9957 ns             9635 ns
         smc_ib_put_memory_region()       203548 ns           198374 ns
         smc_ib_buf_unmap_sg()               508 ns             1158 ns
      
      ------------
      Test environment notes:
      1. Above tests run on 2 VMs within the same Host.
      2. The NIC is ConnectX-4Lx, using SRIOV and passing through 2 VFs to
         the each VM respectively.
      3. VMs' vCPUs are binded to different physical CPUs, and the binded
         physical CPUs are isolated by `isolcpus=xxx` cmdline.
      4. NICs' queue number are set to 1.
      Signed-off-by: NWen Gu <guwen@linux.alibaba.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b8d19945
    • G
      net/smc: remove redundant dma sync ops · 6d52e2de
      Guangguan Wang 提交于
      smc_ib_sync_sg_for_cpu/device are the ops used for dma memory cache
      consistency. Smc sndbufs are dma buffers, where CPU writes data to
      it and PCIE device reads data from it. So for sndbufs,
      smc_ib_sync_sg_for_device is needed and smc_ib_sync_sg_for_cpu is
      redundant as PCIE device will not write the buffers. Smc rmbs
      are dma buffers, where PCIE device write data to it and CPU read
      data from it. So for rmbs, smc_ib_sync_sg_for_cpu is needed and
      smc_ib_sync_sg_for_device is redundant as CPU will not write the buffers.
      Signed-off-by: NGuangguan Wang <guangguan.wang@linux.alibaba.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6d52e2de
  8. 26 5月, 2022 1 次提交
  9. 25 5月, 2022 1 次提交
  10. 23 5月, 2022 2 次提交
    • L
      net/smc: fix listen processing for SMC-Rv2 · 8c3b8dc5
      liuyacan 提交于
      In the process of checking whether RDMAv2 is available, the current
      implementation first sets ini->smcrv2.ib_dev_v2, and then allocates
      smc buf desc, but the latter may fail. Unfortunately, the caller
      will only check the former. In this case, a NULL pointer reference
      will occur in smc_clc_send_confirm_accept() when accessing
      conn->rmb_desc.
      
      This patch does two things:
      1. Use the return code to determine whether V2 is available.
      2. If the return code is NODEV, continue to check whether V1 is
      available.
      
      Fixes: e49300a6 ("net/smc: add listen processing for SMC-Rv2")
      Signed-off-by: Nliuyacan <liuyacan@corp.netease.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8c3b8dc5
    • L
      net/smc: postpone sk_refcnt increment in connect() · 75c1edf2
      liuyacan 提交于
      Same trigger condition as commit 86434744. When setsockopt runs
      in parallel to a connect(), and switch the socket into fallback
      mode. Then the sk_refcnt is incremented in smc_connect(), but
      its state stay in SMC_INIT (NOT SMC_ACTIVE). This cause the
      corresponding sk_refcnt decrement in __smc_release() will not be
      performed.
      
      Fixes: 86434744 ("net/smc: add fallback check to connect()")
      Signed-off-by: Nliuyacan <liuyacan@corp.netease.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      75c1edf2
  11. 16 5月, 2022 1 次提交
  12. 26 4月, 2022 2 次提交
  13. 25 4月, 2022 1 次提交
  14. 15 4月, 2022 1 次提交
  15. 12 4月, 2022 1 次提交
  16. 07 3月, 2022 1 次提交
    • D
      net/smc: fix compile warning for smc_sysctl · 7de8eb0d
      Dust Li 提交于
      kernel test robot reports multiple warning for smc_sysctl:
      
        In file included from net/smc/smc_sysctl.c:17:
      >> net/smc/smc_sysctl.h:23:5: warning: no previous prototype \
      	for function 'smc_sysctl_init' [-Wmissing-prototypes]
        int smc_sysctl_init(void)
             ^
      and
        >> WARNING: modpost: vmlinux.o(.text+0x12ced2d): Section mismatch \
        in reference from the function smc_sysctl_exit() to the variable
        .init.data:smc_sysctl_ops
        The function smc_sysctl_exit() references
        the variable __initdata smc_sysctl_ops.
        This is often because smc_sysctl_exit lacks a __initdata
        annotation or the annotation of smc_sysctl_ops is wrong.
      
      and
        net/smc/smc_sysctl.c: In function 'smc_sysctl_init_net':
        net/smc/smc_sysctl.c:47:17: error: 'struct netns_smc' has no member named 'smc_hdr'
           47 |         net->smc.smc_hdr = register_net_sysctl(net, "net/smc", table);
      
      Since we don't need global sysctl initialization. To make things
      clean and simple, remove the global pernet_operations and
      smc_sysctl_{init|exit}. Call smc_sysctl_net_{init|exit} directly
      from smc_net_{init|exit}.
      
      Also initialized sysctl_autocorking_size if CONFIG_SYSCTL it not
      set, this make sure SMC autocorking is enabled by default if
      CONFIG_SYSCTL is not set.
      
      Fixes: 462791bb ("net/smc: add sysctl interface for SMC")
      Reported-by: Nkernel test robot <lkp@intel.com>
      Signed-off-by: NDust Li <dust.li@linux.alibaba.com>
      Tested-by: Randy Dunlap <rdunlap@infradead.org> # build-tested
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7de8eb0d
  17. 01 3月, 2022 3 次提交
  18. 28 2月, 2022 1 次提交
  19. 25 2月, 2022 1 次提交
    • D
      net/smc: fix connection leak · 9f1c50cf
      D. Wythe 提交于
      There's a potential leak issue under following execution sequence :
      
      smc_release  				smc_connect_work
      if (sk->sk_state == SMC_INIT)
      					send_clc_confirim
      	tcp_abort();
      					...
      					sk.sk_state = SMC_ACTIVE
      smc_close_active
      switch(sk->sk_state) {
      ...
      case SMC_ACTIVE:
      	smc_close_final()
      	// then wait peer closed
      
      Unfortunately, tcp_abort() may discard CLC CONFIRM messages that are
      still in the tcp send buffer, in which case our connection token cannot
      be delivered to the server side, which means that we cannot get a
      passive close message at all. Therefore, it is impossible for the to be
      disconnected at all.
      
      This patch tries a very simple way to avoid this issue, once the state
      has changed to SMC_ACTIVE after tcp_abort(), we can actively abort the
      smc connection, considering that the state is SMC_INIT before
      tcp_abort(), abandoning the complete disconnection process should not
      cause too much problem.
      
      In fact, this problem may exist as long as the CLC CONFIRM message is
      not received by the server. Whether a timer should be added after
      smc_close_final() needs to be discussed in the future. But even so, this
      patch provides a faster release for connection in above case, it should
      also be valuable.
      
      Fixes: 39f41f36 ("net/smc: common release code for non-accepted sockets")
      Signed-off-by: ND. Wythe <alibuda@linux.alibaba.com>
      Acked-by: NKarsten Graul <kgraul@linux.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9f1c50cf
  20. 20 2月, 2022 1 次提交
  21. 17 2月, 2022 1 次提交
  22. 11 2月, 2022 6 次提交
    • D
      net/smc: Add global configure for handshake limitation by netlink · f9496b7c
      D. Wythe 提交于
      Although we can control SMC handshake limitation through socket options,
      which means that applications who need it must modify their code. It's
      quite troublesome for many existing applications. This patch modifies
      the global default value of SMC handshake limitation through netlink,
      providing a way to put constraint on handshake without modifies any code
      for applications.
      Suggested-by: NTony Lu <tonylu@linux.alibaba.com>
      Signed-off-by: ND. Wythe <alibuda@linux.alibaba.com>
      Reviewed-by: NTony Lu <tonylu@linux.alibaba.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f9496b7c
    • D
      net/smc: Dynamic control handshake limitation by socket options · a6a6fe27
      D. Wythe 提交于
      This patch aims to add dynamic control for SMC handshake limitation for
      every smc sockets, in production environment, it is possible for the
      same applications to handle different service types, and may have
      different opinion on SMC handshake limitation.
      
      This patch try socket options to complete it, since we don't have socket
      option level for SMC yet, which requires us to implement it at the same
      time.
      
      This patch does the following:
      
      - add new socket option level: SOL_SMC.
      - add new SMC socket option: SMC_LIMIT_HS.
      - provide getter/setter for SMC socket options.
      
      Link: https://lore.kernel.org/all/20f504f961e1a803f85d64229ad84260434203bd.1644323503.git.alibuda@linux.alibaba.com/Signed-off-by: ND. Wythe <alibuda@linux.alibaba.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a6a6fe27
    • D
      net/smc: Limit SMC visits when handshake workqueue congested · 48b6190a
      D. Wythe 提交于
      This patch intends to provide a mechanism to put constraint on SMC
      connections visit according to the pressure of SMC handshake process.
      At present, frequent visits will cause the incoming connections to be
      backlogged in SMC handshake queue, raise the connections established
      time. Which is quite unacceptable for those applications who base on
      short lived connections.
      
      There are two ways to implement this mechanism:
      
      1. Put limitation after TCP established.
      2. Put limitation before TCP established.
      
      In the first way, we need to wait and receive CLC messages that the
      client will potentially send, and then actively reply with a decline
      message, in a sense, which is also a sort of SMC handshake, affect the
      connections established time on its way.
      
      In the second way, the only problem is that we need to inject SMC logic
      into TCP when it is about to reply the incoming SYN, since we already do
      that, it's seems not a problem anymore. And advantage is obvious, few
      additional processes are required to complete the constraint.
      
      This patch use the second way. After this patch, connections who beyond
      constraint will not informed any SMC indication, and SMC will not be
      involved in any of its subsequent processes.
      
      Link: https://lore.kernel.org/all/1641301961-59331-1-git-send-email-alibuda@linux.alibaba.com/Signed-off-by: ND. Wythe <alibuda@linux.alibaba.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      48b6190a
    • D
      net/smc: Limit backlog connections · 8270d9c2
      D. Wythe 提交于
      Current implementation does not handling backlog semantics, one
      potential risk is that server will be flooded by infinite amount
      connections, even if client was SMC-incapable.
      
      This patch works to put a limit on backlog connections, referring to the
      TCP implementation, we divides SMC connections into two categories:
      
      1. Half SMC connection, which includes all TCP established while SMC not
      connections.
      
      2. Full SMC connection, which includes all SMC established connections.
      
      For half SMC connection, since all half SMC connections starts with TCP
      established, we can achieve our goal by put a limit before TCP
      established. Refer to the implementation of TCP, this limits will based
      on not only the half SMC connections but also the full connections,
      which is also a constraint on full SMC connections.
      
      For full SMC connections, although we know exactly where it starts, it's
      quite hard to put a limit before it. The easiest way is to block wait
      before receive SMC confirm CLC message, while it's under protection by
      smc_server_lgr_pending, a global lock, which leads this limit to the
      entire host instead of a single listen socket. Another way is to drop
      the full connections, but considering the cast of SMC connections, we
      prefer to keep full SMC connections.
      
      Even so, the limits of full SMC connections still exists, see commits
      about half SMC connection below.
      
      After this patch, the limits of backend connection shows like:
      
      For SMC:
      
      1. Client with SMC-capability can makes 2 * backlog full SMC connections
         or 1 * backlog half SMC connections and 1 * backlog full SMC
         connections at most.
      
      2. Client without SMC-capability can only makes 1 * backlog half TCP
         connections and 1 * backlog full TCP connections.
      Signed-off-by: ND. Wythe <alibuda@linux.alibaba.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8270d9c2
    • D
      net/smc: Make smc_tcp_listen_work() independent · 3079e342
      D. Wythe 提交于
      In multithread and 10K connections benchmark, the backend TCP connection
      established very slowly, and lots of TCP connections stay in SYN_SENT
      state.
      
      Client: smc_run wrk -c 10000 -t 4 http://server
      
      the netstate of server host shows like:
          145042 times the listen queue of a socket overflowed
          145042 SYNs to LISTEN sockets dropped
      
      One reason of this issue is that, since the smc_tcp_listen_work() shared
      the same workqueue (smc_hs_wq) with smc_listen_work(), while the
      smc_listen_work() do blocking wait for smc connection established. Once
      the workqueue became congested, it's will block the accept() from TCP
      listen.
      
      This patch creates a independent workqueue(smc_tcp_ls_wq) for
      smc_tcp_listen_work(), separate it from smc_listen_work(), which is
      quite acceptable considering that smc_tcp_listen_work() runs very fast.
      Signed-off-by: ND. Wythe <alibuda@linux.alibaba.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3079e342
    • W
      net/smc: Avoid overwriting the copies of clcsock callback functions · 1de9770d
      Wen Gu 提交于
      The callback functions of clcsock will be saved and replaced during
      the fallback. But if the fallback happens more than once, then the
      copies of these callback functions will be overwritten incorrectly,
      resulting in a loop call issue:
      
      clcsk->sk_error_report
       |- smc_fback_error_report() <------------------------------|
           |- smc_fback_forward_wakeup()                          | (loop)
               |- clcsock_callback()  (incorrectly overwritten)   |
                   |- smc->clcsk_error_report() ------------------|
      
      So this patch fixes the issue by saving these function pointers only
      once in the fallback and avoiding overwriting.
      
      Reported-by: syzbot+4de3c0e8a263e1e499bc@syzkaller.appspotmail.com
      Fixes: 341adeec ("net/smc: Forward wakeup to smc socket waitqueue after fallback")
      Link: https://lore.kernel.org/r/0000000000006d045e05d78776f6@google.comSigned-off-by: NWen Gu <guwen@linux.alibaba.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1de9770d
  23. 31 1月, 2022 3 次提交
    • T
      net/smc: Cork when sendpage with MSG_SENDPAGE_NOTLAST flag · be9a16cc
      Tony Lu 提交于
      This introduces a new corked flag, MSG_SENDPAGE_NOTLAST, which is
      involved in syscall sendfile() [1], it indicates this is not the last
      page. So we can cork the data until the page is not specify this flag.
      It has the same effect as MSG_MORE, but existed in sendfile() only.
      
      This patch adds a option MSG_SENDPAGE_NOTLAST for corking data, try to
      cork more data before sending when using sendfile(), which acts like
      TCP's behaviour. Also, this reimplements the default sendpage to inform
      that it is supported to some extent.
      
      [1] https://man7.org/linux/man-pages/man2/sendfile.2.htmlSigned-off-by: NTony Lu <tonylu@linux.alibaba.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      be9a16cc
    • T
      net/smc: Send directly when TCP_CORK is cleared · ea785a1a
      Tony Lu 提交于
      According to the man page of TCP_CORK [1], if set, don't send out
      partial frames. All queued partial frames are sent when option is
      cleared again.
      
      When applications call setsockopt to disable TCP_CORK, this call is
      protected by lock_sock(), and tries to mod_delayed_work() to 0, in order
      to send pending data right now. However, the delayed work smc_tx_work is
      also protected by lock_sock(). There introduces lock contention for
      sending data.
      
      To fix it, send pending data directly which acts like TCP, without
      lock_sock() protected in the context of setsockopt (already lock_sock()ed),
      and cancel unnecessary dealyed work, which is protected by lock.
      
      [1] https://linux.die.net/man/7/tcpSigned-off-by: NTony Lu <tonylu@linux.alibaba.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ea785a1a
    • W
      net/smc: Forward wakeup to smc socket waitqueue after fallback · 341adeec
      Wen Gu 提交于
      When we replace TCP with SMC and a fallback occurs, there may be
      some socket waitqueue entries remaining in smc socket->wq, such
      as eppoll_entries inserted by userspace applications.
      
      After the fallback, data flows over TCP/IP and only clcsocket->wq
      will be woken up. Applications can't be notified by the entries
      which were inserted in smc socket->wq before fallback. So we need
      a mechanism to wake up smc socket->wq at the same time if some
      entries remaining in it.
      
      The current workaround is to transfer the entries from smc socket->wq
      to clcsock->wq during the fallback. But this may cause a crash
      like this:
      
       general protection fault, probably for non-canonical address 0xdead000000000100: 0000 [#1] PREEMPT SMP PTI
       CPU: 3 PID: 0 Comm: swapper/3 Kdump: loaded Tainted: G E     5.16.0+ #107
       RIP: 0010:__wake_up_common+0x65/0x170
       Call Trace:
        <IRQ>
        __wake_up_common_lock+0x7a/0xc0
        sock_def_readable+0x3c/0x70
        tcp_data_queue+0x4a7/0xc40
        tcp_rcv_established+0x32f/0x660
        ? sk_filter_trim_cap+0xcb/0x2e0
        tcp_v4_do_rcv+0x10b/0x260
        tcp_v4_rcv+0xd2a/0xde0
        ip_protocol_deliver_rcu+0x3b/0x1d0
        ip_local_deliver_finish+0x54/0x60
        ip_local_deliver+0x6a/0x110
        ? tcp_v4_early_demux+0xa2/0x140
        ? tcp_v4_early_demux+0x10d/0x140
        ip_sublist_rcv_finish+0x49/0x60
        ip_sublist_rcv+0x19d/0x230
        ip_list_rcv+0x13e/0x170
        __netif_receive_skb_list_core+0x1c2/0x240
        netif_receive_skb_list_internal+0x1e6/0x320
        napi_complete_done+0x11d/0x190
        mlx5e_napi_poll+0x163/0x6b0 [mlx5_core]
        __napi_poll+0x3c/0x1b0
        net_rx_action+0x27c/0x300
        __do_softirq+0x114/0x2d2
        irq_exit_rcu+0xb4/0xe0
        common_interrupt+0xba/0xe0
        </IRQ>
        <TASK>
      
      The crash is caused by privately transferring waitqueue entries from
      smc socket->wq to clcsock->wq. The owners of these entries, such as
      epoll, have no idea that the entries have been transferred to a
      different socket wait queue and still use original waitqueue spinlock
      (smc socket->wq.wait.lock) to make the entries operation exclusive,
      but it doesn't work. The operations to the entries, such as removing
      from the waitqueue (now is clcsock->wq after fallback), may cause a
      crash when clcsock waitqueue is being iterated over at the moment.
      
      This patch tries to fix this by no longer transferring wait queue
      entries privately, but introducing own implementations of clcsock's
      callback functions in fallback situation. The callback functions will
      forward the wakeup to smc socket->wq if clcsock->wq is actually woken
      up and smc socket->wq has remaining entries.
      
      Fixes: 2153bd1e ("net/smc: Transfer remaining wait queue entries during fallback")
      Suggested-by: NKarsten Graul <kgraul@linux.ibm.com>
      Signed-off-by: NWen Gu <guwen@linux.alibaba.com>
      Acked-by: NKarsten Graul <kgraul@linux.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      341adeec
  24. 24 1月, 2022 1 次提交
    • W
      net/smc: Transitional solution for clcsock race issue · c0bf3d8a
      Wen Gu 提交于
      We encountered a crash in smc_setsockopt() and it is caused by
      accessing smc->clcsock after clcsock was released.
      
       BUG: kernel NULL pointer dereference, address: 0000000000000020
       #PF: supervisor read access in kernel mode
       #PF: error_code(0x0000) - not-present page
       PGD 0 P4D 0
       Oops: 0000 [#1] PREEMPT SMP PTI
       CPU: 1 PID: 50309 Comm: nginx Kdump: loaded Tainted: G E     5.16.0-rc4+ #53
       RIP: 0010:smc_setsockopt+0x59/0x280 [smc]
       Call Trace:
        <TASK>
        __sys_setsockopt+0xfc/0x190
        __x64_sys_setsockopt+0x20/0x30
        do_syscall_64+0x34/0x90
        entry_SYSCALL_64_after_hwframe+0x44/0xae
       RIP: 0033:0x7f16ba83918e
        </TASK>
      
      This patch tries to fix it by holding clcsock_release_lock and
      checking whether clcsock has already been released before access.
      
      In case that a crash of the same reason happens in smc_getsockopt()
      or smc_switch_to_fallback(), this patch also checkes smc->clcsock
      in them too. And the caller of smc_switch_to_fallback() will identify
      whether fallback succeeds according to the return value.
      
      Fixes: fd57770d ("net/smc: wait for pending work before clcsock release_sock")
      Link: https://lore.kernel.org/lkml/5dd7ffd1-28e2-24cc-9442-1defec27375e@linux.ibm.com/T/Signed-off-by: NWen Gu <guwen@linux.alibaba.com>
      Acked-by: NKarsten Graul <kgraul@linux.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c0bf3d8a
  25. 13 1月, 2022 1 次提交
  26. 06 1月, 2022 1 次提交
    • W
      net/smc: Reset conn->lgr when link group registration fails · 36595d8a
      Wen Gu 提交于
      SMC connections might fail to be registered in a link group due to
      unable to find a usable link during its creation. As a result,
      smc_conn_create() will return a failure and most resources related
      to the connection won't be applied or initialized, such as
      conn->abort_work or conn->lnk.
      
      If smc_conn_free() is invoked later, it will try to access the
      uninitialized resources related to the connection, thus causing
      a warning or crash.
      
      This patch tries to fix this by resetting conn->lgr to NULL if an
      abnormal exit occurs in smc_lgr_register_conn(), thus avoiding the
      access to uninitialized resources in smc_conn_free().
      
      Meanwhile, the new created link group should be terminated if smc
      connections can't be registered in it. So smc_lgr_cleanup_early() is
      modified to take care of link group only and invoked to terminate
      unusable link group by smc_conn_create(). The call to smc_conn_free()
      is moved out from smc_lgr_cleanup_early() to smc_conn_abort().
      
      Fixes: 56bc3b20 ("net/smc: assign link to a new connection")
      Suggested-by: NKarsten Graul <kgraul@linux.ibm.com>
      Signed-off-by: NWen Gu <guwen@linux.alibaba.com>
      Acked-by: NKarsten Graul <kgraul@linux.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      36595d8a
  27. 02 1月, 2022 1 次提交
    • T
      net/smc: Introduce TCP ULP support · d7cd421d
      Tony Lu 提交于
      This implements TCP ULP for SMC, helps applications to replace TCP with
      SMC protocol in place. And we use it to implement transparent
      replacement.
      
      This replaces original TCP sockets with SMC, reuse TCP as clcsock when
      calling setsockopt with TCP_ULP option, and without any overhead.
      
      To replace TCP sockets with SMC, there are two approaches:
      
      - use setsockopt() syscall with TCP_ULP option, if error, it would
        fallback to TCP.
      
      - use BPF prog with types BPF_CGROUP_INET_SOCK_CREATE or others to
        replace transparently. BPF hooks some points in create socket, bind
        and others, users can inject their BPF logics without modifying their
        applications, and choose which connections should be replaced with SMC
        by calling setsockopt() in BPF prog, based on rules, such as TCP tuples,
        PID, cgroup, etc...
      
        BPF doesn't support calling setsockopt with TCP_ULP now, I will send the
        patches after this accepted.
      Signed-off-by: NTony Lu <tonylu@linux.alibaba.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d7cd421d
  28. 17 12月, 2021 1 次提交
    • D
      net/smc: Prevent smc_release() from long blocking · 5c15b312
      D. Wythe 提交于
      In nginx/wrk benchmark, there's a hung problem with high probability
      on case likes that: (client will last several minutes to exit)
      
      server: smc_run nginx
      
      client: smc_run wrk -c 10000 -t 1 http://server
      
      Client hangs with the following backtrace:
      
      0 [ffffa7ce8Of3bbf8] __schedule at ffffffff9f9eOd5f
      1 [ffffa7ce8Of3bc88] schedule at ffffffff9f9eløe6
      2 [ffffa7ce8Of3bcaO] schedule_timeout at ffffffff9f9e3f3c
      3 [ffffa7ce8Of3bd2O] wait_for_common at ffffffff9f9el9de
      4 [ffffa7ce8Of3bd8O] __flush_work at ffffffff9fOfeOl3
      5 [ffffa7ce8øf3bdfO] smc_release at ffffffffcO697d24 [smc]
      6 [ffffa7ce8Of3be2O] __sock_release at ffffffff9f8O2e2d
      7 [ffffa7ce8Of3be4ø] sock_close at ffffffff9f8ø2ebl
      8 [ffffa7ce8øf3be48] __fput at ffffffff9f334f93
      9 [ffffa7ce8Of3be78] task_work_run at ffffffff9flOlff5
      10 [ffffa7ce8Of3beaO] do_exit at ffffffff9fOe5Ol2
      11 [ffffa7ce8Of3bflO] do_group_exit at ffffffff9fOe592a
      12 [ffffa7ce8Of3bf38] __x64_sys_exit_group at ffffffff9fOe5994
      13 [ffffa7ce8Of3bf4O] do_syscall_64 at ffffffff9f9d4373
      14 [ffffa7ce8Of3bfsO] entry_SYSCALL_64_after_hwframe at ffffffff9fa0007c
      
      This issue dues to flush_work(), which is used to wait for
      smc_connect_work() to finish in smc_release(). Once lots of
      smc_connect_work() was pending or all executing work dangling,
      smc_release() has to block until one worker comes to free, which
      is equivalent to wait another smc_connnect_work() to finish.
      
      In order to fix this, There are two changes:
      
      1. For those idle smc_connect_work(), cancel it from the workqueue; for
         executing smc_connect_work(), waiting for it to finish. For that
         purpose, replace flush_work() with cancel_work_sync().
      
      2. Since smc_connect() hold a reference for passive closing, if
         smc_connect_work() has been cancelled, release the reference.
      
      Fixes: 24ac3a08 ("net/smc: rebuild nonblocking connect")
      Reported-by: NTony Lu <tonylu@linux.alibaba.com>
      Tested-by: NDust Li <dust.li@linux.alibaba.com>
      Reviewed-by: NDust Li <dust.li@linux.alibaba.com>
      Reviewed-by: NTony Lu <tonylu@linux.alibaba.com>
      Signed-off-by: ND. Wythe <alibuda@linux.alibaba.com>
      Acked-by: NKarsten Graul <kgraul@linux.ibm.com>
      Link: https://lore.kernel.org/r/1639571361-101128-1-git-send-email-alibuda@linux.alibaba.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      5c15b312