1. 25 1月, 2023 3 次提交
  2. 15 10月, 2022 1 次提交
  3. 22 9月, 2022 2 次提交
    • T
      net/smc: Unbind r/w buffer size from clcsock and make them tunable · 0227f058
      Tony Lu 提交于
      Currently, SMC uses smc->sk.sk_{rcv|snd}buf to create buffers for
      send buffer and RMB. And the values of buffer size are from tcp_{w|r}mem
      in clcsock.
      
      The buffer size from TCP socket doesn't fit SMC well. Generally, buffers
      are usually larger than TCP for SMC-R/-D to get higher performance, for
      they are different underlay devices and paths.
      
      So this patch unbinds buffer size from TCP, and introduces two sysctl
      knobs to tune them independently. Also, these knobs are per net
      namespace and work for containers.
      Signed-off-by: NTony Lu <tonylu@linux.alibaba.com>
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      0227f058
    • W
      net/smc: Stop the CLC flow if no link to map buffers on · e738455b
      Wen Gu 提交于
      There might be a potential race between SMC-R buffer map and
      link group termination.
      
      smc_smcr_terminate_all()     | smc_connect_rdma()
      --------------------------------------------------------------
                                   | smc_conn_create()
      for links in smcibdev        |
              schedule links down  |
                                   | smc_buf_create()
                                   |  \- smcr_buf_map_usable_links()
                                   |      \- no usable links found,
                                   |         (rmb->mr = NULL)
                                   |
                                   | smc_clc_send_confirm()
                                   |  \- access conn->rmb_desc->mr[]->rkey
                                   |     (panic)
      
      During reboot and IB device module remove, all links will be set
      down and no usable links remain in link groups. In such situation
      smcr_buf_map_usable_links() should return an error and stop the
      CLC flow accessing to uninitialized mr.
      
      Fixes: b9247544 ("net/smc: convert static link ID instances to support multiple links")
      Signed-off-by: NWen Gu <guwen@linux.alibaba.com>
      Link: https://lore.kernel.org/r/1663656189-32090-1-git-send-email-guwen@linux.alibaba.comSigned-off-by: NPaolo Abeni <pabeni@redhat.com>
      e738455b
  4. 07 9月, 2022 1 次提交
    • Y
      net/smc: Fix possible access to freed memory in link clear · e9b1a4f8
      Yacan Liu 提交于
      After modifying the QP to the Error state, all RX WR would be completed
      with WC in IB_WC_WR_FLUSH_ERR status. Current implementation does not
      wait for it is done, but destroy the QP and free the link group directly.
      So there is a risk that accessing the freed memory in tasklet context.
      
      Here is a crash example:
      
       BUG: unable to handle page fault for address: ffffffff8f220860
       #PF: supervisor write access in kernel mode
       #PF: error_code(0x0002) - not-present page
       PGD f7300e067 P4D f7300e067 PUD f7300f063 PMD 8c4e45063 PTE 800ffff08c9df060
       Oops: 0002 [#1] SMP PTI
       CPU: 1 PID: 0 Comm: swapper/1 Kdump: loaded Tainted: G S         OE     5.10.0-0607+ #23
       Hardware name: Inspur NF5280M4/YZMB-00689-101, BIOS 4.1.20 07/09/2018
       RIP: 0010:native_queued_spin_lock_slowpath+0x176/0x1b0
       Code: f3 90 48 8b 32 48 85 f6 74 f6 eb d5 c1 ee 12 83 e0 03 83 ee 01 48 c1 e0 05 48 63 f6 48 05 00 c8 02 00 48 03 04 f5 00 09 98 8e <48> 89 10 8b 42 08 85 c0 75 09 f3 90 8b 42 08 85 c0 74 f7 48 8b 32
       RSP: 0018:ffffb3b6c001ebd8 EFLAGS: 00010086
       RAX: ffffffff8f220860 RBX: 0000000000000246 RCX: 0000000000080000
       RDX: ffff91db1f86c800 RSI: 000000000000173c RDI: ffff91db62bace00
       RBP: ffff91db62bacc00 R08: 0000000000000000 R09: c00000010000028b
       R10: 0000000000055198 R11: ffffb3b6c001ea58 R12: ffff91db80e05010
       R13: 000000000000000a R14: 0000000000000006 R15: 0000000000000040
       FS:  0000000000000000(0000) GS:ffff91db1f840000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: ffffffff8f220860 CR3: 00000001f9580004 CR4: 00000000003706e0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       Call Trace:
        <IRQ>
        _raw_spin_lock_irqsave+0x30/0x40
        mlx5_ib_poll_cq+0x4c/0xc50 [mlx5_ib]
        smc_wr_rx_tasklet_fn+0x56/0xa0 [smc]
        tasklet_action_common.isra.21+0x66/0x100
        __do_softirq+0xd5/0x29c
        asm_call_irq_on_stack+0x12/0x20
        </IRQ>
        do_softirq_own_stack+0x37/0x40
        irq_exit_rcu+0x9d/0xa0
        sysvec_call_function_single+0x34/0x80
        asm_sysvec_call_function_single+0x12/0x20
      
      Fixes: bd4ad577 ("smc: initialize IB transport incl. PD, MR, QP, CQ, event, WR")
      Signed-off-by: NYacan Liu <liuyacan@corp.netease.com>
      Reviewed-by: NTony Lu <tonylu@linux.alibaba.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e9b1a4f8
  5. 18 7月, 2022 5 次提交
    • W
      net/smc: Extend SMC-R link group netlink attribute · ddefb2d2
      Wen Gu 提交于
      Extend SMC-R link group netlink attribute SMC_GEN_LGR_SMCR.
      Introduce SMC_NLA_LGR_R_BUF_TYPE to show the buffer type of
      SMC-R link group.
      Signed-off-by: NWen Gu <guwen@linux.alibaba.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ddefb2d2
    • W
      net/smc: Allow virtually contiguous sndbufs or RMBs for SMC-R · b8d19945
      Wen Gu 提交于
      On long-running enterprise production servers, high-order contiguous
      memory pages are usually very rare and in most cases we can only get
      fragmented pages.
      
      When replacing TCP with SMC-R in such production scenarios, attempting
      to allocate high-order physically contiguous sndbufs and RMBs may result
      in frequent memory compaction, which will cause unexpected hung issue
      and further stability risks.
      
      So this patch is aimed to allow SMC-R link group to use virtually
      contiguous sndbufs and RMBs to avoid potential issues mentioned above.
      Whether to use physically or virtually contiguous buffers can be set
      by sysctl smcr_buf_type.
      
      Note that using virtually contiguous buffers will bring an acceptable
      performance regression, which can be mainly divided into two parts:
      
      1) regression in data path, which is brought by additional address
         translation of sndbuf by RNIC in Tx. But in general, translating
         address through MTT is fast.
      
         Taking 256KB sndbuf and RMB as an example, the comparisons in qperf
         latency and bandwidth test with physically and virtually contiguous
         buffers are as follows:
      
      - client:
        smc_run taskset -c <cpu> qperf <server> -oo msg_size:1:64K:*2\
        -t 5 -vu tcp_{bw|lat}
      - server:
        smc_run taskset -c <cpu> qperf
      
         [latency]
         msgsize              tcp            smcr        smcr-use-virt-buf
         1               11.17 us         7.56 us         7.51 us (-0.67%)
         2               10.65 us         7.74 us         7.56 us (-2.31%)
         4               11.11 us         7.52 us         7.59 us ( 0.84%)
         8               10.83 us         7.55 us         7.51 us (-0.48%)
         16              11.21 us         7.46 us         7.51 us ( 0.71%)
         32              10.65 us         7.53 us         7.58 us ( 0.61%)
         64              10.95 us         7.74 us         7.80 us ( 0.76%)
         128             11.14 us         7.83 us         7.87 us ( 0.47%)
         256             10.97 us         7.94 us         7.92 us (-0.28%)
         512             11.23 us         7.94 us         8.20 us ( 3.25%)
         1024            11.60 us         8.12 us         8.20 us ( 0.96%)
         2048            14.04 us         8.30 us         8.51 us ( 2.49%)
         4096            16.88 us         9.13 us         9.07 us (-0.64%)
         8192            22.50 us        10.56 us        11.22 us ( 6.26%)
         16384           28.99 us        12.88 us        13.83 us ( 7.37%)
         32768           40.13 us        16.76 us        16.95 us ( 1.16%)
         65536           68.70 us        24.68 us        24.85 us ( 0.68%)
         [bandwidth]
         msgsize                tcp              smcr          smcr-use-virt-buf
         1                1.65 MB/s         1.59 MB/s         1.53 MB/s (-3.88%)
         2                3.32 MB/s         3.17 MB/s         3.08 MB/s (-2.67%)
         4                6.66 MB/s         6.33 MB/s         6.09 MB/s (-3.85%)
         8               13.67 MB/s        13.45 MB/s        11.97 MB/s (-10.99%)
         16              25.36 MB/s        27.15 MB/s        24.16 MB/s (-11.01%)
         32              48.22 MB/s        54.24 MB/s        49.41 MB/s (-8.89%)
         64             106.79 MB/s       107.32 MB/s        99.05 MB/s (-7.71%)
         128            210.21 MB/s       202.46 MB/s       201.02 MB/s (-0.71%)
         256            400.81 MB/s       416.81 MB/s       393.52 MB/s (-5.59%)
         512            746.49 MB/s       834.12 MB/s       809.99 MB/s (-2.89%)
         1024          1292.33 MB/s      1641.96 MB/s      1571.82 MB/s (-4.27%)
         2048          2007.64 MB/s      2760.44 MB/s      2717.68 MB/s (-1.55%)
         4096          2665.17 MB/s      4157.44 MB/s      4070.76 MB/s (-2.09%)
         8192          3159.72 MB/s      4361.57 MB/s      4270.65 MB/s (-2.08%)
         16384         4186.70 MB/s      4574.13 MB/s      4501.17 MB/s (-1.60%)
         32768         4093.21 MB/s      4487.42 MB/s      4322.43 MB/s (-3.68%)
         65536         4057.14 MB/s      4735.61 MB/s      4555.17 MB/s (-3.81%)
      
      2) regression in buffer initialization and destruction path, which is
         brought by additional MR operations of sndbufs. But thanks to link
         group buffer reuse mechanism, the impact of this kind of regression
         decreases as times of buffer reuse increases.
      
         Taking 256KB sndbuf and RMB as an example, latency of some key SMC-R
         buffer-related function obtained by bpftrace are as follows:
      
         Function                         Phys-bufs           Virt-bufs
         smcr_new_buf_create()             67154 ns            79164 ns
         smc_ib_buf_map_sg()                 525 ns              928 ns
         smc_ib_get_memory_region()       162294 ns           161191 ns
         smc_wr_reg_send()                  9957 ns             9635 ns
         smc_ib_put_memory_region()       203548 ns           198374 ns
         smc_ib_buf_unmap_sg()               508 ns             1158 ns
      
      ------------
      Test environment notes:
      1. Above tests run on 2 VMs within the same Host.
      2. The NIC is ConnectX-4Lx, using SRIOV and passing through 2 VFs to
         the each VM respectively.
      3. VMs' vCPUs are binded to different physical CPUs, and the binded
         physical CPUs are isolated by `isolcpus=xxx` cmdline.
      4. NICs' queue number are set to 1.
      Signed-off-by: NWen Gu <guwen@linux.alibaba.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b8d19945
    • W
      net/smc: Use sysctl-specified types of buffers in new link group · b984f370
      Wen Gu 提交于
      This patch introduces a new SMC-R specific element buf_type
      in struct smc_link_group, for recording the value of sysctl
      smcr_buf_type when link group is created.
      
      New created link group will create and reuse buffers of the
      type specified by buf_type.
      Signed-off-by: NWen Gu <guwen@linux.alibaba.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b984f370
    • G
      net/smc: optimize for smc_sndbuf_sync_sg_for_device and smc_rmb_sync_sg_for_cpu · 0ef69e78
      Guangguan Wang 提交于
      Some CPU, such as Xeon, can guarantee DMA cache coherency.
      So it is no need to use dma sync APIs to flush cache on such CPUs.
      In order to avoid calling dma sync APIs on the IO path, use the
      dma_need_sync to check whether smc_buf_desc needs dma sync when
      creating smc_buf_desc.
      Signed-off-by: NGuangguan Wang <guangguan.wang@linux.alibaba.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0ef69e78
    • G
      net/smc: remove redundant dma sync ops · 6d52e2de
      Guangguan Wang 提交于
      smc_ib_sync_sg_for_cpu/device are the ops used for dma memory cache
      consistency. Smc sndbufs are dma buffers, where CPU writes data to
      it and PCIE device reads data from it. So for sndbufs,
      smc_ib_sync_sg_for_device is needed and smc_ib_sync_sg_for_cpu is
      redundant as PCIE device will not write the buffers. Smc rmbs
      are dma buffers, where PCIE device write data to it and CPU read
      data from it. So for rmbs, smc_ib_sync_sg_for_cpu is needed and
      smc_ib_sync_sg_for_device is redundant as CPU will not write the buffers.
      Signed-off-by: NGuangguan Wang <guangguan.wang@linux.alibaba.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6d52e2de
  6. 03 3月, 2022 2 次提交
    • D
      net/smc: fix unexpected SMC_CLC_DECL_ERR_REGRMB error cause by server · 4940a1fd
      D. Wythe 提交于
      The problem of SMC_CLC_DECL_ERR_REGRMB on the server is very clear.
      Based on the fact that whether a new SMC connection can be accepted or
      not depends on not only the limit of conn nums, but also the available
      entries of rtoken. Since the rtoken release is trigger by peer, while
      the conn nums is decrease by local, tons of thing can happen in this
      time difference.
      
      This only thing that needs to be mentioned is that now all connection
      creations are completely protected by smc_server_lgr_pending lock, it's
      enough to check only the available entries in rtokens_used_mask.
      
      Fixes: cd6851f3 ("smc: remote memory buffers (RMBs)")
      Signed-off-by: ND. Wythe <alibuda@linux.alibaba.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4940a1fd
    • D
      net/smc: fix unexpected SMC_CLC_DECL_ERR_REGRMB error generated by client · 0537f0a2
      D. Wythe 提交于
      The main reason for this unexpected SMC_CLC_DECL_ERR_REGRMB in client
      dues to following execution sequence:
      
      Server Conn A:           Server Conn B:			Client Conn B:
      
      smc_lgr_unregister_conn
                              smc_lgr_register_conn
                              smc_clc_send_accept     ->
                                                              smc_rtoken_add
      smcr_buf_unuse
      		->		Client Conn A:
      				smc_rtoken_delete
      
      smc_lgr_unregister_conn() makes current link available to assigned to new
      incoming connection, while smcr_buf_unuse() has not executed yet, which
      means that smc_rtoken_add may fail because of insufficient rtoken_entry,
      reversing their execution order will avoid this problem.
      
      Fixes: 3e034725 ("net/smc: common functions for RMBs and send buffers")
      Signed-off-by: ND. Wythe <alibuda@linux.alibaba.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0537f0a2
  7. 01 3月, 2022 1 次提交
    • D
      net/smc: correct settings of RMB window update limit · 6bf536eb
      Dust Li 提交于
      rmbe_update_limit is used to limit announcing receive
      window updating too frequently. RFC7609 request a minimal
      increase in the window size of 10% of the receive buffer
      space. But current implementation used:
      
        min_t(int, rmbe_size / 10, SOCK_MIN_SNDBUF / 2)
      
      and SOCK_MIN_SNDBUF / 2 == 2304 Bytes, which is almost
      always less then 10% of the receive buffer space.
      
      This causes the receiver always sending CDC message to
      update its consumer cursor when it consumes more then 2K
      of data. And as a result, we may encounter something like
      "TCP silly window syndrome" when sending 2.5~8K message.
      
      This patch fixes this using max(rmbe_size / 10, SOCK_MIN_SNDBUF / 2).
      
      With this patch and SMC autocorking enabled, qperf 2K/4K/8K
      tcp_bw test shows 45%/75%/40% increase in throughput respectively.
      Signed-off-by: NDust Li <dust.li@linux.alibaba.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6bf536eb
  8. 16 1月, 2022 1 次提交
    • W
      net/smc: Fix hung_task when removing SMC-R devices · 56d99e81
      Wen Gu 提交于
      A hung_task is observed when removing SMC-R devices. Suppose that
      a link group has two active links(lnk_A, lnk_B) associated with two
      different SMC-R devices(dev_A, dev_B). When dev_A is removed, the
      link group will be removed from smc_lgr_list and added into
      lgr_linkdown_list. lnk_A will be cleared and smcibdev(A)->lnk_cnt
      will reach to zero. However, when dev_B is removed then, the link
      group can't be found in smc_lgr_list and lnk_B won't be cleared,
      making smcibdev->lnk_cnt never reaches zero, which causes a hung_task.
      
      This patch fixes this issue by restoring the implementation of
      smc_smcr_terminate_all() to what it was before commit 349d4312
      ("net/smc: fix kernel panic caused by race of smc_sock"). The original
      implementation also satisfies the intention that make sure QP destroy
      earlier than CQ destroy because we will always wait for smcibdev->lnk_cnt
      reaches zero, which guarantees QP has been destroyed.
      
      Fixes: 349d4312 ("net/smc: fix kernel panic caused by race of smc_sock")
      Signed-off-by: NWen Gu <guwen@linux.alibaba.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      56d99e81
  9. 13 1月, 2022 3 次提交
    • W
      net/smc: Resolve the race between SMC-R link access and clear · 20c9398d
      Wen Gu 提交于
      We encountered some crashes caused by the race between SMC-R
      link access and link clear that triggered by abnormal link
      group termination, such as port error.
      
      Here is an example of this kind of crashes:
      
       BUG: kernel NULL pointer dereference, address: 0000000000000000
       Workqueue: smc_hs_wq smc_listen_work [smc]
       RIP: 0010:smc_llc_flow_initiate+0x44/0x190 [smc]
       Call Trace:
        <TASK>
        ? __smc_buf_create+0x75a/0x950 [smc]
        smcr_lgr_reg_rmbs+0x2a/0xbf [smc]
        smc_listen_work+0xf72/0x1230 [smc]
        ? process_one_work+0x25c/0x600
        process_one_work+0x25c/0x600
        worker_thread+0x4f/0x3a0
        ? process_one_work+0x600/0x600
        kthread+0x15d/0x1a0
        ? set_kthread_struct+0x40/0x40
        ret_from_fork+0x1f/0x30
        </TASK>
      
      smc_listen_work()                     __smc_lgr_terminate()
      ---------------------------------------------------------------
                                          | smc_lgr_free()
                                          |  |- smcr_link_clear()
                                          |      |- memset(lnk, 0)
      smc_listen_rdma_reg()               |
       |- smcr_lgr_reg_rmbs()             |
           |- smc_llc_flow_initiate()     |
               |- access lnk->lgr (panic) |
      
      These crashes are similarly caused by clearing SMC-R link
      resources when some functions is still accessing to them.
      This patch tries to fix the issue by introducing reference
      count of SMC-R links and ensuring that the sensitive resources
      of links won't be cleared until reference count reaches zero.
      
      The operation to the SMC-R link reference count can be concluded
      as follows:
      
      object          [hold or initialized as 1]         [put]
      --------------------------------------------------------------------
      links           smcr_link_init()                   smcr_link_clear()
      connections     smc_conn_create()                  smc_conn_free()
      
      Through this way, the clear of SMC-R links is later than the
      free of all the smc connections above it, thus avoiding the
      unsafe reference to SMC-R links.
      Signed-off-by: NWen Gu <guwen@linux.alibaba.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      20c9398d
    • W
      net/smc: Introduce a new conn->lgr validity check helper · ea89c6c0
      Wen Gu 提交于
      It is no longer suitable to identify whether a smc connection
      is registered in a link group through checking if conn->lgr
      is NULL, because conn->lgr won't be reset even the connection
      is unregistered from a link group.
      
      So this patch introduces a new helper smc_conn_lgr_valid() and
      replaces all the check of conn->lgr in original implementation
      with the new helper to judge if conn->lgr is valid to use.
      Signed-off-by: NWen Gu <guwen@linux.alibaba.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ea89c6c0
    • W
      net/smc: Resolve the race between link group access and termination · 61f434b0
      Wen Gu 提交于
      We encountered some crashes caused by the race between the access
      and the termination of link groups.
      
      Here are some of panic stacks we met:
      
      1) Race between smc_clc_wait_msg() and __smc_lgr_terminate()
      
       BUG: kernel NULL pointer dereference, address: 00000000000002f0
       Workqueue: smc_hs_wq smc_listen_work [smc]
       RIP: 0010:smc_clc_wait_msg+0x3eb/0x5c0 [smc]
       Call Trace:
        <TASK>
        ? smc_clc_send_accept+0x45/0xa0 [smc]
        ? smc_clc_send_accept+0x45/0xa0 [smc]
        smc_listen_work+0x783/0x1220 [smc]
        ? finish_task_switch+0xc4/0x2e0
        ? process_one_work+0x1ad/0x3c0
        process_one_work+0x1ad/0x3c0
        worker_thread+0x4c/0x390
        ? rescuer_thread+0x320/0x320
        kthread+0x149/0x190
        ? set_kthread_struct+0x40/0x40
        ret_from_fork+0x1f/0x30
        </TASK>
      
      smc_listen_work()                abnormal case like port error
      ---------------------------------------------------------------
                                      | __smc_lgr_terminate()
                                      |  |- smc_conn_kill()
                                      |      |- smc_lgr_unregister_conn()
                                      |          |- set conn->lgr = NULL
      smc_clc_wait_msg()              |
       |- access conn->lgr (panic)    |
      
      2) Race between smc_setsockopt() and __smc_lgr_terminate()
      
       BUG: kernel NULL pointer dereference, address: 00000000000002e8
       RIP: 0010:smc_setsockopt+0x17a/0x280 [smc]
       Call Trace:
        <TASK>
        __sys_setsockopt+0xfc/0x190
        __x64_sys_setsockopt+0x20/0x30
        do_syscall_64+0x34/0x90
        entry_SYSCALL_64_after_hwframe+0x44/0xae
        </TASK>
      
      smc_setsockopt()                 abnormal case like port error
      --------------------------------------------------------------
                                      | __smc_lgr_terminate()
                                      |  |- smc_conn_kill()
                                      |      |- smc_lgr_unregister_conn()
                                      |          |- set conn->lgr = NULL
      mod_delayed_work()              |
       |- access conn->lgr (panic)    |
      
      There are some other panic places and they are caused by the
      similar reason as described above, which is accessing link
      group after termination, thus getting a NULL pointer or invalid
      resource.
      
      Currently, there seems to be no synchronization between the
      link group access and a sudden termination of it. This patch
      tries to fix this by introducing reference count of link group
      and not freeing link group until reference count is zero.
      
      Link group might be referred to by links or smc connections. So
      the operation to the link group reference count can be concluded
      as follows:
      
      object          [hold or initialized as 1]       [put]
      -------------------------------------------------------------------
      link group      smc_lgr_create()                 smc_lgr_free()
      connections     smc_conn_create()                smc_conn_free()
      links           smcr_link_init()                 smcr_link_clear()
      
      Througth this way, we extend the life cycle of link group and
      ensure it is longer than the life cycle of connections and links
      above it, so that avoid invalid access to link group after its
      termination.
      Signed-off-by: NWen Gu <guwen@linux.alibaba.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      61f434b0
  10. 06 1月, 2022 1 次提交
    • W
      net/smc: Reset conn->lgr when link group registration fails · 36595d8a
      Wen Gu 提交于
      SMC connections might fail to be registered in a link group due to
      unable to find a usable link during its creation. As a result,
      smc_conn_create() will return a failure and most resources related
      to the connection won't be applied or initialized, such as
      conn->abort_work or conn->lnk.
      
      If smc_conn_free() is invoked later, it will try to access the
      uninitialized resources related to the connection, thus causing
      a warning or crash.
      
      This patch tries to fix this by resetting conn->lgr to NULL if an
      abnormal exit occurs in smc_lgr_register_conn(), thus avoiding the
      access to uninitialized resources in smc_conn_free().
      
      Meanwhile, the new created link group should be terminated if smc
      connections can't be registered in it. So smc_lgr_cleanup_early() is
      modified to take care of link group only and invoked to terminate
      unusable link group by smc_conn_create(). The call to smc_conn_free()
      is moved out from smc_lgr_cleanup_early() to smc_conn_abort().
      
      Fixes: 56bc3b20 ("net/smc: assign link to a new connection")
      Suggested-by: NKarsten Graul <kgraul@linux.ibm.com>
      Signed-off-by: NWen Gu <guwen@linux.alibaba.com>
      Acked-by: NKarsten Graul <kgraul@linux.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      36595d8a
  11. 02 1月, 2022 3 次提交
  12. 28 12月, 2021 2 次提交
    • D
      net/smc: fix kernel panic caused by race of smc_sock · 349d4312
      Dust Li 提交于
      A crash occurs when smc_cdc_tx_handler() tries to access smc_sock
      but smc_release() has already freed it.
      
      [ 4570.695099] BUG: unable to handle page fault for address: 000000002eae9e88
      [ 4570.696048] #PF: supervisor write access in kernel mode
      [ 4570.696728] #PF: error_code(0x0002) - not-present page
      [ 4570.697401] PGD 0 P4D 0
      [ 4570.697716] Oops: 0002 [#1] PREEMPT SMP NOPTI
      [ 4570.698228] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.16.0-rc4+ #111
      [ 4570.699013] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 8c24b4c 04/0
      [ 4570.699933] RIP: 0010:_raw_spin_lock+0x1a/0x30
      <...>
      [ 4570.711446] Call Trace:
      [ 4570.711746]  <IRQ>
      [ 4570.711992]  smc_cdc_tx_handler+0x41/0xc0
      [ 4570.712470]  smc_wr_tx_tasklet_fn+0x213/0x560
      [ 4570.712981]  ? smc_cdc_tx_dismisser+0x10/0x10
      [ 4570.713489]  tasklet_action_common.isra.17+0x66/0x140
      [ 4570.714083]  __do_softirq+0x123/0x2f4
      [ 4570.714521]  irq_exit_rcu+0xc4/0xf0
      [ 4570.714934]  common_interrupt+0xba/0xe0
      
      Though smc_cdc_tx_handler() checked the existence of smc connection,
      smc_release() may have already dismissed and released the smc socket
      before smc_cdc_tx_handler() further visits it.
      
      smc_cdc_tx_handler()           |smc_release()
      if (!conn)                     |
                                     |
                                     |smc_cdc_tx_dismiss_slots()
                                     |      smc_cdc_tx_dismisser()
                                     |
                                     |sock_put(&smc->sk) <- last sock_put,
                                     |                      smc_sock freed
      bh_lock_sock(&smc->sk) (panic) |
      
      To make sure we won't receive any CDC messages after we free the
      smc_sock, add a refcount on the smc_connection for inflight CDC
      message(posted to the QP but haven't received related CQE), and
      don't release the smc_connection until all the inflight CDC messages
      haven been done, for both success or failed ones.
      
      Using refcount on CDC messages brings another problem: when the link
      is going to be destroyed, smcr_link_clear() will reset the QP, which
      then remove all the pending CQEs related to the QP in the CQ. To make
      sure all the CQEs will always come back so the refcount on the
      smc_connection can always reach 0, smc_ib_modify_qp_reset() was replaced
      by smc_ib_modify_qp_error().
      And remove the timeout in smc_wr_tx_wait_no_pending_sends() since we
      need to wait for all pending WQEs done, or we may encounter use-after-
      free when handling CQEs.
      
      For IB device removal routine, we need to wait for all the QPs on that
      device been destroyed before we can destroy CQs on the device, or
      the refcount on smc_connection won't reach 0 and smc_sock cannot be
      released.
      
      Fixes: 5f08318f ("smc: connection data control (CDC)")
      Reported-by: NWen Gu <guwen@linux.alibaba.com>
      Signed-off-by: NDust Li <dust.li@linux.alibaba.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      349d4312
    • D
      net/smc: don't send CDC/LLC message if link not ready · 90cee52f
      Dust Li 提交于
      We found smc_llc_send_link_delete_all() sometimes wait
      for 2s timeout when testing with RDMA link up/down.
      It is possible when a smc_link is in ACTIVATING state,
      the underlaying QP is still in RESET or RTR state, which
      cannot send any messages out.
      
      smc_llc_send_link_delete_all() use smc_link_usable() to
      checks whether the link is usable, if the QP is still in
      RESET or RTR state, but the smc_link is in ACTIVATING, this
      LLC message will always fail without any CQE entering the
      CQ, and we will always wait 2s before timeout.
      
      Since we cannot send any messages through the QP before
      the QP enter RTS. I add a wrapper smc_link_sendable()
      which checks the state of QP along with the link state.
      And replace smc_link_usable() with smc_link_sendable()
      in all LLC & CDC message sending routine.
      
      Fixes: 5f08318f ("smc: connection data control (CDC)")
      Signed-off-by: NDust Li <dust.li@linux.alibaba.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      90cee52f
  13. 07 12月, 2021 1 次提交
    • T
      net/smc: Clear memory when release and reuse buffer · 1c552696
      Tony Lu 提交于
      Currently, buffers are cleared when smc connections are created and
      buffers are reused. This slows down the speed of establishing new
      connections. In most cases, the applications want to establish
      connections as quickly as possible.
      
      This patch moves memset() from connection creation path to release and
      buffer unuse path, this trades off between speed of establishing and
      release.
      
      Test environments:
      - CPU Intel Xeon Platinum 8 core, mem 32 GiB, nic Mellanox CX4
      - socket sndbuf / rcvbuf: 16384 / 131072 bytes
      - w/o first round, 5 rounds, avg, 100 conns batch per round
      - smc_buf_create() use bpftrace kprobe, introduces extra latency
      
      Latency benchmarks for smc_buf_create():
        w/o patch : 19040.0 ns
        w/  patch :  1932.6 ns
        ratio :        10.2% (-89.8%)
      
      Latency benchmarks for socket create and connect:
        w/o patch :   143.3 us
        w/  patch :   102.2 us
        ratio :        71.3% (-28.7%)
      
      The latency of establishing connections is reduced by 28.7%.
      Signed-off-by: NTony Lu <tonylu@linux.alibaba.com>
      Reviewed-by: NWen Gu <guwen@linux.alibaba.com>
      Signed-off-by: NKarsten Graul <kgraul@linux.ibm.com>
      Link: https://lore.kernel.org/r/20211203113331.2818873-1-kgraul@linux.ibm.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      1c552696
  14. 02 12月, 2021 1 次提交
    • D
      net/smc: fix wrong list_del in smc_lgr_cleanup_early · 789b6cc2
      Dust Li 提交于
      smc_lgr_cleanup_early() meant to delete the link
      group from the link group list, but it deleted
      the list head by mistake.
      
      This may cause memory corruption since we didn't
      remove the real link group from the list and later
      memseted the link group structure.
      We got a list corruption panic when testing:
      
      [  231.277259] list_del corruption. prev->next should be ffff8881398a8000, but was 0000000000000000
      [  231.278222] ------------[ cut here ]------------
      [  231.278726] kernel BUG at lib/list_debug.c:53!
      [  231.279326] invalid opcode: 0000 [#1] SMP NOPTI
      [  231.279803] CPU: 0 PID: 5 Comm: kworker/0:0 Not tainted 5.10.46+ #435
      [  231.280466] Hardware name: Alibaba Cloud ECS, BIOS 8c24b4c 04/01/2014
      [  231.281248] Workqueue: events smc_link_down_work
      [  231.281732] RIP: 0010:__list_del_entry_valid+0x70/0x90
      [  231.282258] Code: 4c 60 82 e8 7d cc 6a 00 0f 0b 48 89 fe 48 c7 c7 88 4c
      60 82 e8 6c cc 6a 00 0f 0b 48 89 fe 48 c7 c7 c0 4c 60 82 e8 5b cc 6a 00 <0f>
      0b 48 89 fe 48 c7 c7 00 4d 60 82 e8 4a cc 6a 00 0f 0b cc cc cc
      [  231.284146] RSP: 0018:ffffc90000033d58 EFLAGS: 00010292
      [  231.284685] RAX: 0000000000000054 RBX: ffff8881398a8000 RCX: 0000000000000000
      [  231.285415] RDX: 0000000000000001 RSI: ffff88813bc18040 RDI: ffff88813bc18040
      [  231.286141] RBP: ffffffff8305ad40 R08: 0000000000000003 R09: 0000000000000001
      [  231.286873] R10: ffffffff82803da0 R11: ffffc90000033b90 R12: 0000000000000001
      [  231.287606] R13: 0000000000000000 R14: ffff8881398a8000 R15: 0000000000000003
      [  231.288337] FS:  0000000000000000(0000) GS:ffff88813bc00000(0000) knlGS:0000000000000000
      [  231.289160] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  231.289754] CR2: 0000000000e72058 CR3: 000000010fa96006 CR4: 00000000003706f0
      [  231.290485] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  231.291211] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [  231.291940] Call Trace:
      [  231.292211]  smc_lgr_terminate_sched+0x53/0xa0
      [  231.292677]  smc_switch_conns+0x75/0x6b0
      [  231.293085]  ? update_load_avg+0x1a6/0x590
      [  231.293517]  ? ttwu_do_wakeup+0x17/0x150
      [  231.293907]  ? update_load_avg+0x1a6/0x590
      [  231.294317]  ? newidle_balance+0xca/0x3d0
      [  231.294716]  smcr_link_down+0x50/0x1a0
      [  231.295090]  ? __wake_up_common_lock+0x77/0x90
      [  231.295534]  smc_link_down_work+0x46/0x60
      [  231.295933]  process_one_work+0x18b/0x350
      
      Fixes: a0a62ee1 ("net/smc: separate locks for SMCD and SMCR link group lists")
      Signed-off-by: NDust Li <dust.li@linux.alibaba.com>
      Acked-by: NKarsten Graul <kgraul@linux.ibm.com>
      Reviewed-by: NTony Lu <tonylu@linux.alibaba.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      789b6cc2
  15. 25 11月, 2021 1 次提交
    • K
      net/smc: Fix NULL pointer dereferencing in smc_vlan_by_tcpsk() · 587acad4
      Karsten Graul 提交于
      Coverity reports a possible NULL dereferencing problem:
      
      in smc_vlan_by_tcpsk():
      6. returned_null: netdev_lower_get_next returns NULL (checked 29 out of 30 times).
      7. var_assigned: Assigning: ndev = NULL return value from netdev_lower_get_next.
      1623                ndev = (struct net_device *)netdev_lower_get_next(ndev, &lower);
      CID 1468509 (#1 of 1): Dereference null return value (NULL_RETURNS)
      8. dereference: Dereferencing a pointer that might be NULL ndev when calling is_vlan_dev.
      1624                if (is_vlan_dev(ndev)) {
      
      Remove the manual implementation and use netdev_walk_all_lower_dev() to
      iterate over the lower devices. While on it remove an obsolete function
      parameter comment.
      
      Fixes: cb9d43f6 ("net/smc: determine vlan_id of stacked net_device")
      Suggested-by: NJulian Wiedmann <jwi@linux.ibm.com>
      Signed-off-by: NKarsten Graul <kgraul@linux.ibm.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      587acad4
  16. 15 11月, 2021 1 次提交
  17. 01 11月, 2021 1 次提交
  18. 16 10月, 2021 4 次提交
  19. 09 10月, 2021 1 次提交
  20. 21 9月, 2021 1 次提交
    • K
      net/smc: fix 'workqueue leaked lock' in smc_conn_abort_work · a18cee47
      Karsten Graul 提交于
      The abort_work is scheduled when a connection was detected to be
      out-of-sync after a link failure. The work calls smc_conn_kill(),
      which calls smc_close_active_abort() and that might end up calling
      smc_close_cancel_work().
      smc_close_cancel_work() cancels any pending close_work and tx_work but
      needs to release the sock_lock before and acquires the sock_lock again
      afterwards. So when the sock_lock was NOT acquired before then it may
      be held after the abort_work completes. Thats why the sock_lock is
      acquired before the call to smc_conn_kill() in __smc_lgr_terminate(),
      but this is missing in smc_conn_abort_work().
      
      Fix that by acquiring the sock_lock first and release it after the
      call to smc_conn_kill().
      
      Fixes: b286a065 ("net/smc: handle incoming CDC validation message")
      Signed-off-by: NKarsten Graul <kgraul@linux.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a18cee47
  21. 14 9月, 2021 1 次提交
  22. 09 8月, 2021 2 次提交
  23. 17 6月, 2021 1 次提交