1. 18 7月, 2022 11 次提交
    • J
      tls: rx: remove the message decrypted tracking · 53d57999
      Jakub Kicinski 提交于
      We no longer allow a decrypted skb to remain linked to ctx->recv_pkt.
      Anything on the list is decrypted, anything on ctx->recv_pkt needs
      to be decrypted.
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      53d57999
    • J
      tls: rx: don't keep decrypted skbs on ctx->recv_pkt · abb47dc9
      Jakub Kicinski 提交于
      Detach the skb from ctx->recv_pkt after decryption is done,
      even if we can't consume it.
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      abb47dc9
    • J
      tls: rx: don't try to keep the skbs always on the list · 008141de
      Jakub Kicinski 提交于
      I thought that having the skb either always on the ctx->rx_list
      or ctx->recv_pkt will simplify the handling, as we would not
      have to remember to flip it from one to the other on exit paths.
      
      This became a little harder to justify with the fix for BPF
      sockmaps. Subsequent changes will make the situation even worse.
      Queue the skbs only when really needed.
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      008141de
    • J
      tls: rx: allow only one reader at a time · 4cbc325e
      Jakub Kicinski 提交于
      recvmsg() in TLS gets data from the skb list (rx_list) or fresh
      skbs we read from TCP via strparser. The former holds skbs which were
      already decrypted for peek or decrypted and partially consumed.
      
      tls_wait_data() only notices appearance of fresh skbs coming out
      of TCP (or psock). It is possible, if there is a concurrent call
      to peek() and recv() that the peek() will move the data from input
      to rx_list without recv() noticing. recv() will then read data out
      of order or never wake up.
      
      This is not a practical use case/concern, but it makes the self
      tests less reliable. This patch solves the problem by allowing
      only one reader in.
      
      Because having multiple processes calling read()/peek() is not
      normal avoid adding a lock and try to fast-path the single reader
      case.
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4cbc325e
    • D
      Merge branch 'net-smc-virt-contig-buffers' · 3898f52c
      David S. Miller 提交于
      Wen Gu says:
      
      ====================
      net/smc: Introduce virtually contiguous buffers for SMC-R
      
      On long-running enterprise production servers, high-order contiguous
      memory pages are usually very rare and in most cases we can only get
      fragmented pages.
      
      When replacing TCP with SMC-R in such production scenarios, attempting
      to allocate high-order physically contiguous sndbufs and RMBs may result
      in frequent memory compaction, which will cause unexpected hung issue
      and further stability risks.
      
      So this patch set is aimed to allow SMC-R link group to use virtually
      contiguous sndbufs and RMBs to avoid potential issues mentioned above.
      Whether to use physically or virtually contiguous buffers can be set
      by sysctl smcr_buf_type.
      
      Note that using virtually contiguous buffers will bring an acceptable
      performance regression, which can be mainly divided into two parts:
      
      1) regression in data path, which is brought by additional address
         translation of sndbuf by RNIC in Tx. But in general, translating
         address through MTT is fast. According to qperf test, this part
         regression is basically less than 10% in latency and bandwidth.
         (see patch 5/6 for details)
      
      2) regression in buffer initialization and destruction path, which is
         brought by additional MR operations of sndbufs. But thanks to link
         group buffer reuse mechanism, the impact of this kind of regression
         decreases as times of buffer reuse increases.
      
      Patch set overview:
      - Patch 1/6 and 2/6 mainly about simplifying and optimizing DMA sync
        operation, which will reduce overhead on the data path, especially
        when using virtually contiguous buffers;
      - Patch 3/6 and 4/6 introduce a sysctl smcr_buf_type to set the type
        of buffers in new created link group;
      - Patch 5/6 allows SMC-R to use virtually contiguous sndbufs and RMBs,
        including buffer creation, destruction, MR operation and access;
      - patch 6/6 extends netlink attribute for buffer type of SMC-R link group;
      
      v1->v2:
      - Patch 5/6 fixes build issue on 32bit;
      - Patch 3/6 adds description of new sysctl in smc-sysctl.rst;
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3898f52c
    • W
      net/smc: Extend SMC-R link group netlink attribute · ddefb2d2
      Wen Gu 提交于
      Extend SMC-R link group netlink attribute SMC_GEN_LGR_SMCR.
      Introduce SMC_NLA_LGR_R_BUF_TYPE to show the buffer type of
      SMC-R link group.
      Signed-off-by: NWen Gu <guwen@linux.alibaba.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ddefb2d2
    • W
      net/smc: Allow virtually contiguous sndbufs or RMBs for SMC-R · b8d19945
      Wen Gu 提交于
      On long-running enterprise production servers, high-order contiguous
      memory pages are usually very rare and in most cases we can only get
      fragmented pages.
      
      When replacing TCP with SMC-R in such production scenarios, attempting
      to allocate high-order physically contiguous sndbufs and RMBs may result
      in frequent memory compaction, which will cause unexpected hung issue
      and further stability risks.
      
      So this patch is aimed to allow SMC-R link group to use virtually
      contiguous sndbufs and RMBs to avoid potential issues mentioned above.
      Whether to use physically or virtually contiguous buffers can be set
      by sysctl smcr_buf_type.
      
      Note that using virtually contiguous buffers will bring an acceptable
      performance regression, which can be mainly divided into two parts:
      
      1) regression in data path, which is brought by additional address
         translation of sndbuf by RNIC in Tx. But in general, translating
         address through MTT is fast.
      
         Taking 256KB sndbuf and RMB as an example, the comparisons in qperf
         latency and bandwidth test with physically and virtually contiguous
         buffers are as follows:
      
      - client:
        smc_run taskset -c <cpu> qperf <server> -oo msg_size:1:64K:*2\
        -t 5 -vu tcp_{bw|lat}
      - server:
        smc_run taskset -c <cpu> qperf
      
         [latency]
         msgsize              tcp            smcr        smcr-use-virt-buf
         1               11.17 us         7.56 us         7.51 us (-0.67%)
         2               10.65 us         7.74 us         7.56 us (-2.31%)
         4               11.11 us         7.52 us         7.59 us ( 0.84%)
         8               10.83 us         7.55 us         7.51 us (-0.48%)
         16              11.21 us         7.46 us         7.51 us ( 0.71%)
         32              10.65 us         7.53 us         7.58 us ( 0.61%)
         64              10.95 us         7.74 us         7.80 us ( 0.76%)
         128             11.14 us         7.83 us         7.87 us ( 0.47%)
         256             10.97 us         7.94 us         7.92 us (-0.28%)
         512             11.23 us         7.94 us         8.20 us ( 3.25%)
         1024            11.60 us         8.12 us         8.20 us ( 0.96%)
         2048            14.04 us         8.30 us         8.51 us ( 2.49%)
         4096            16.88 us         9.13 us         9.07 us (-0.64%)
         8192            22.50 us        10.56 us        11.22 us ( 6.26%)
         16384           28.99 us        12.88 us        13.83 us ( 7.37%)
         32768           40.13 us        16.76 us        16.95 us ( 1.16%)
         65536           68.70 us        24.68 us        24.85 us ( 0.68%)
         [bandwidth]
         msgsize                tcp              smcr          smcr-use-virt-buf
         1                1.65 MB/s         1.59 MB/s         1.53 MB/s (-3.88%)
         2                3.32 MB/s         3.17 MB/s         3.08 MB/s (-2.67%)
         4                6.66 MB/s         6.33 MB/s         6.09 MB/s (-3.85%)
         8               13.67 MB/s        13.45 MB/s        11.97 MB/s (-10.99%)
         16              25.36 MB/s        27.15 MB/s        24.16 MB/s (-11.01%)
         32              48.22 MB/s        54.24 MB/s        49.41 MB/s (-8.89%)
         64             106.79 MB/s       107.32 MB/s        99.05 MB/s (-7.71%)
         128            210.21 MB/s       202.46 MB/s       201.02 MB/s (-0.71%)
         256            400.81 MB/s       416.81 MB/s       393.52 MB/s (-5.59%)
         512            746.49 MB/s       834.12 MB/s       809.99 MB/s (-2.89%)
         1024          1292.33 MB/s      1641.96 MB/s      1571.82 MB/s (-4.27%)
         2048          2007.64 MB/s      2760.44 MB/s      2717.68 MB/s (-1.55%)
         4096          2665.17 MB/s      4157.44 MB/s      4070.76 MB/s (-2.09%)
         8192          3159.72 MB/s      4361.57 MB/s      4270.65 MB/s (-2.08%)
         16384         4186.70 MB/s      4574.13 MB/s      4501.17 MB/s (-1.60%)
         32768         4093.21 MB/s      4487.42 MB/s      4322.43 MB/s (-3.68%)
         65536         4057.14 MB/s      4735.61 MB/s      4555.17 MB/s (-3.81%)
      
      2) regression in buffer initialization and destruction path, which is
         brought by additional MR operations of sndbufs. But thanks to link
         group buffer reuse mechanism, the impact of this kind of regression
         decreases as times of buffer reuse increases.
      
         Taking 256KB sndbuf and RMB as an example, latency of some key SMC-R
         buffer-related function obtained by bpftrace are as follows:
      
         Function                         Phys-bufs           Virt-bufs
         smcr_new_buf_create()             67154 ns            79164 ns
         smc_ib_buf_map_sg()                 525 ns              928 ns
         smc_ib_get_memory_region()       162294 ns           161191 ns
         smc_wr_reg_send()                  9957 ns             9635 ns
         smc_ib_put_memory_region()       203548 ns           198374 ns
         smc_ib_buf_unmap_sg()               508 ns             1158 ns
      
      ------------
      Test environment notes:
      1. Above tests run on 2 VMs within the same Host.
      2. The NIC is ConnectX-4Lx, using SRIOV and passing through 2 VFs to
         the each VM respectively.
      3. VMs' vCPUs are binded to different physical CPUs, and the binded
         physical CPUs are isolated by `isolcpus=xxx` cmdline.
      4. NICs' queue number are set to 1.
      Signed-off-by: NWen Gu <guwen@linux.alibaba.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b8d19945
    • W
      net/smc: Use sysctl-specified types of buffers in new link group · b984f370
      Wen Gu 提交于
      This patch introduces a new SMC-R specific element buf_type
      in struct smc_link_group, for recording the value of sysctl
      smcr_buf_type when link group is created.
      
      New created link group will create and reuse buffers of the
      type specified by buf_type.
      Signed-off-by: NWen Gu <guwen@linux.alibaba.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b984f370
    • W
      net/smc: Introduce a sysctl for setting SMC-R buffer type · 4bc5008e
      Wen Gu 提交于
      This patch introduces the sysctl smcr_buf_type for setting
      the type of SMC-R sndbufs and RMBs.
      
      Valid values includes:
      
      - SMCR_PHYS_CONT_BUFS, which means use physically contiguous
        buffers for better performance and is the default value.
      
      - SMCR_VIRT_CONT_BUFS, which means use virtually contiguous
        buffers in case of physically contiguous memory is scarce.
      
      - SMCR_MIXED_BUFS, which means first try to use physically
        contiguous buffers. If not available, then use virtually
        contiguous buffers.
      Signed-off-by: NWen Gu <guwen@linux.alibaba.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4bc5008e
    • G
      net/smc: optimize for smc_sndbuf_sync_sg_for_device and smc_rmb_sync_sg_for_cpu · 0ef69e78
      Guangguan Wang 提交于
      Some CPU, such as Xeon, can guarantee DMA cache coherency.
      So it is no need to use dma sync APIs to flush cache on such CPUs.
      In order to avoid calling dma sync APIs on the IO path, use the
      dma_need_sync to check whether smc_buf_desc needs dma sync when
      creating smc_buf_desc.
      Signed-off-by: NGuangguan Wang <guangguan.wang@linux.alibaba.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0ef69e78
    • G
      net/smc: remove redundant dma sync ops · 6d52e2de
      Guangguan Wang 提交于
      smc_ib_sync_sg_for_cpu/device are the ops used for dma memory cache
      consistency. Smc sndbufs are dma buffers, where CPU writes data to
      it and PCIE device reads data from it. So for sndbufs,
      smc_ib_sync_sg_for_device is needed and smc_ib_sync_sg_for_cpu is
      redundant as PCIE device will not write the buffers. Smc rmbs
      are dma buffers, where PCIE device write data to it and CPU read
      data from it. So for rmbs, smc_ib_sync_sg_for_cpu is needed and
      smc_ib_sync_sg_for_device is redundant as CPU will not write the buffers.
      Signed-off-by: NGuangguan Wang <guangguan.wang@linux.alibaba.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6d52e2de
  2. 16 7月, 2022 4 次提交
  3. 15 7月, 2022 23 次提交
  4. 14 7月, 2022 2 次提交
    • N
      selftests/net: test nexthop without gw · cd72e61b
      Nicolas Dichtel 提交于
      This test implement the scenario described in the commit
      "ip: fix dflt addr selection for connected nexthop".
      The test configures a nexthop object with an output device only (no gateway
      address) and a route that uses this nexthop. The goal is to check if the
      kernel selects a valid source address.
      
      Link: https://lore.kernel.org/netdev/20220712095545.10947-1-nicolas.dichtel@6wind.com/Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Link: https://lore.kernel.org/r/20220713114853.29406-2-nicolas.dichtel@6wind.comSigned-off-by: NPaolo Abeni <pabeni@redhat.com>
      cd72e61b
    • N
      ip: fix dflt addr selection for connected nexthop · 747c1430
      Nicolas Dichtel 提交于
      When a nexthop is added, without a gw address, the default scope was set
      to 'host'. Thus, when a source address is selected, 127.0.0.1 may be chosen
      but rejected when the route is used.
      
      When using a route without a nexthop id, the scope can be configured in the
      route, thus the problem doesn't exist.
      
      To explain more deeply: when a user creates a nexthop, it cannot specify
      the scope. To create it, the function nh_create_ipv4() calls fib_check_nh()
      with scope set to 0. fib_check_nh() calls fib_check_nh_nongw() wich was
      setting scope to 'host'. Then, nh_create_ipv4() calls
      fib_info_update_nhc_saddr() with scope set to 'host'. The src addr is
      chosen before the route is inserted.
      
      When a 'standard' route (ie without a reference to a nexthop) is added,
      fib_create_info() calls fib_info_update_nhc_saddr() with the scope set by
      the user. iproute2 set the scope to 'link' by default.
      
      Here is a way to reproduce the problem:
      ip netns add foo
      ip -n foo link set lo up
      ip netns add bar
      ip -n bar link set lo up
      sleep 1
      
      ip -n foo link add name eth0 type dummy
      ip -n foo link set eth0 up
      ip -n foo address add 192.168.0.1/24 dev eth0
      
      ip -n foo link add name veth0 type veth peer name veth1 netns bar
      ip -n foo link set veth0 up
      ip -n bar link set veth1 up
      
      ip -n bar address add 192.168.1.1/32 dev veth1
      ip -n bar route add default dev veth1
      
      ip -n foo nexthop add id 1 dev veth0
      ip -n foo route add 192.168.1.1 nhid 1
      
      Try to get/use the route:
      > $ ip -n foo route get 192.168.1.1
      > RTNETLINK answers: Invalid argument
      > $ ip netns exec foo ping -c1 192.168.1.1
      > ping: connect: Invalid argument
      
      Try without nexthop group (iproute2 sets scope to 'link' by dflt):
      ip -n foo route del 192.168.1.1
      ip -n foo route add 192.168.1.1 dev veth0
      
      Try to get/use the route:
      > $ ip -n foo route get 192.168.1.1
      > 192.168.1.1 dev veth0 src 192.168.0.1 uid 0
      >     cache
      > $ ip netns exec foo ping -c1 192.168.1.1
      > PING 192.168.1.1 (192.168.1.1) 56(84) bytes of data.
      > 64 bytes from 192.168.1.1: icmp_seq=1 ttl=64 time=0.039 ms
      >
      > --- 192.168.1.1 ping statistics ---
      > 1 packets transmitted, 1 received, 0% packet loss, time 0ms
      > rtt min/avg/max/mdev = 0.039/0.039/0.039/0.000 ms
      
      CC: stable@vger.kernel.org
      Fixes: 597cfe4f ("nexthop: Add support for IPv4 nexthops")
      Reported-by: NEdwin Brossette <edwin.brossette@6wind.com>
      Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Link: https://lore.kernel.org/r/20220713114853.29406-1-nicolas.dichtel@6wind.comSigned-off-by: NPaolo Abeni <pabeni@redhat.com>
      747c1430