1. 18 7月, 2022 12 次提交
    • J
      tls: rx: read the input skb from ctx->recv_pkt · 541cc48b
      Jakub Kicinski 提交于
      Callers always pass ctx->recv_pkt into decrypt_skb_update(),
      and it propagates it to its callees. This may give someone
      the false impression that those functions can accept any valid
      skb containing a TLS record. That's not the case, the record
      sequence number is read from the context, and they can only
      take the next record coming out of the strp.
      
      Let the functions get the skb from the context instead of
      passing it in. This will also make it cleaner to return
      a different skb than ctx->recv_pkt as the decrypted one
      later on.
      
      Since we're touching the definition of decrypt_skb_update()
      use this as an opportunity to rename it.
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      541cc48b
    • J
      tls: rx: factor out device darg update · 8a958732
      Jakub Kicinski 提交于
      I already forgot to transform darg from input to output
      semantics once on the NIC inline crypto fastpath. To
      avoid this happening again create a device equivalent
      of decrypt_internal(). A function responsible for decryption
      and transforming darg.
      
      While at it rename decrypt_internal() to a hopefully slightly
      more meaningful name.
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8a958732
    • J
      tls: rx: remove the message decrypted tracking · 53d57999
      Jakub Kicinski 提交于
      We no longer allow a decrypted skb to remain linked to ctx->recv_pkt.
      Anything on the list is decrypted, anything on ctx->recv_pkt needs
      to be decrypted.
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      53d57999
    • J
      tls: rx: don't keep decrypted skbs on ctx->recv_pkt · abb47dc9
      Jakub Kicinski 提交于
      Detach the skb from ctx->recv_pkt after decryption is done,
      even if we can't consume it.
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      abb47dc9
    • J
      tls: rx: don't try to keep the skbs always on the list · 008141de
      Jakub Kicinski 提交于
      I thought that having the skb either always on the ctx->rx_list
      or ctx->recv_pkt will simplify the handling, as we would not
      have to remember to flip it from one to the other on exit paths.
      
      This became a little harder to justify with the fix for BPF
      sockmaps. Subsequent changes will make the situation even worse.
      Queue the skbs only when really needed.
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      008141de
    • J
      tls: rx: allow only one reader at a time · 4cbc325e
      Jakub Kicinski 提交于
      recvmsg() in TLS gets data from the skb list (rx_list) or fresh
      skbs we read from TCP via strparser. The former holds skbs which were
      already decrypted for peek or decrypted and partially consumed.
      
      tls_wait_data() only notices appearance of fresh skbs coming out
      of TCP (or psock). It is possible, if there is a concurrent call
      to peek() and recv() that the peek() will move the data from input
      to rx_list without recv() noticing. recv() will then read data out
      of order or never wake up.
      
      This is not a practical use case/concern, but it makes the self
      tests less reliable. This patch solves the problem by allowing
      only one reader in.
      
      Because having multiple processes calling read()/peek() is not
      normal avoid adding a lock and try to fast-path the single reader
      case.
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4cbc325e
    • W
      net/smc: Extend SMC-R link group netlink attribute · ddefb2d2
      Wen Gu 提交于
      Extend SMC-R link group netlink attribute SMC_GEN_LGR_SMCR.
      Introduce SMC_NLA_LGR_R_BUF_TYPE to show the buffer type of
      SMC-R link group.
      Signed-off-by: NWen Gu <guwen@linux.alibaba.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ddefb2d2
    • W
      net/smc: Allow virtually contiguous sndbufs or RMBs for SMC-R · b8d19945
      Wen Gu 提交于
      On long-running enterprise production servers, high-order contiguous
      memory pages are usually very rare and in most cases we can only get
      fragmented pages.
      
      When replacing TCP with SMC-R in such production scenarios, attempting
      to allocate high-order physically contiguous sndbufs and RMBs may result
      in frequent memory compaction, which will cause unexpected hung issue
      and further stability risks.
      
      So this patch is aimed to allow SMC-R link group to use virtually
      contiguous sndbufs and RMBs to avoid potential issues mentioned above.
      Whether to use physically or virtually contiguous buffers can be set
      by sysctl smcr_buf_type.
      
      Note that using virtually contiguous buffers will bring an acceptable
      performance regression, which can be mainly divided into two parts:
      
      1) regression in data path, which is brought by additional address
         translation of sndbuf by RNIC in Tx. But in general, translating
         address through MTT is fast.
      
         Taking 256KB sndbuf and RMB as an example, the comparisons in qperf
         latency and bandwidth test with physically and virtually contiguous
         buffers are as follows:
      
      - client:
        smc_run taskset -c <cpu> qperf <server> -oo msg_size:1:64K:*2\
        -t 5 -vu tcp_{bw|lat}
      - server:
        smc_run taskset -c <cpu> qperf
      
         [latency]
         msgsize              tcp            smcr        smcr-use-virt-buf
         1               11.17 us         7.56 us         7.51 us (-0.67%)
         2               10.65 us         7.74 us         7.56 us (-2.31%)
         4               11.11 us         7.52 us         7.59 us ( 0.84%)
         8               10.83 us         7.55 us         7.51 us (-0.48%)
         16              11.21 us         7.46 us         7.51 us ( 0.71%)
         32              10.65 us         7.53 us         7.58 us ( 0.61%)
         64              10.95 us         7.74 us         7.80 us ( 0.76%)
         128             11.14 us         7.83 us         7.87 us ( 0.47%)
         256             10.97 us         7.94 us         7.92 us (-0.28%)
         512             11.23 us         7.94 us         8.20 us ( 3.25%)
         1024            11.60 us         8.12 us         8.20 us ( 0.96%)
         2048            14.04 us         8.30 us         8.51 us ( 2.49%)
         4096            16.88 us         9.13 us         9.07 us (-0.64%)
         8192            22.50 us        10.56 us        11.22 us ( 6.26%)
         16384           28.99 us        12.88 us        13.83 us ( 7.37%)
         32768           40.13 us        16.76 us        16.95 us ( 1.16%)
         65536           68.70 us        24.68 us        24.85 us ( 0.68%)
         [bandwidth]
         msgsize                tcp              smcr          smcr-use-virt-buf
         1                1.65 MB/s         1.59 MB/s         1.53 MB/s (-3.88%)
         2                3.32 MB/s         3.17 MB/s         3.08 MB/s (-2.67%)
         4                6.66 MB/s         6.33 MB/s         6.09 MB/s (-3.85%)
         8               13.67 MB/s        13.45 MB/s        11.97 MB/s (-10.99%)
         16              25.36 MB/s        27.15 MB/s        24.16 MB/s (-11.01%)
         32              48.22 MB/s        54.24 MB/s        49.41 MB/s (-8.89%)
         64             106.79 MB/s       107.32 MB/s        99.05 MB/s (-7.71%)
         128            210.21 MB/s       202.46 MB/s       201.02 MB/s (-0.71%)
         256            400.81 MB/s       416.81 MB/s       393.52 MB/s (-5.59%)
         512            746.49 MB/s       834.12 MB/s       809.99 MB/s (-2.89%)
         1024          1292.33 MB/s      1641.96 MB/s      1571.82 MB/s (-4.27%)
         2048          2007.64 MB/s      2760.44 MB/s      2717.68 MB/s (-1.55%)
         4096          2665.17 MB/s      4157.44 MB/s      4070.76 MB/s (-2.09%)
         8192          3159.72 MB/s      4361.57 MB/s      4270.65 MB/s (-2.08%)
         16384         4186.70 MB/s      4574.13 MB/s      4501.17 MB/s (-1.60%)
         32768         4093.21 MB/s      4487.42 MB/s      4322.43 MB/s (-3.68%)
         65536         4057.14 MB/s      4735.61 MB/s      4555.17 MB/s (-3.81%)
      
      2) regression in buffer initialization and destruction path, which is
         brought by additional MR operations of sndbufs. But thanks to link
         group buffer reuse mechanism, the impact of this kind of regression
         decreases as times of buffer reuse increases.
      
         Taking 256KB sndbuf and RMB as an example, latency of some key SMC-R
         buffer-related function obtained by bpftrace are as follows:
      
         Function                         Phys-bufs           Virt-bufs
         smcr_new_buf_create()             67154 ns            79164 ns
         smc_ib_buf_map_sg()                 525 ns              928 ns
         smc_ib_get_memory_region()       162294 ns           161191 ns
         smc_wr_reg_send()                  9957 ns             9635 ns
         smc_ib_put_memory_region()       203548 ns           198374 ns
         smc_ib_buf_unmap_sg()               508 ns             1158 ns
      
      ------------
      Test environment notes:
      1. Above tests run on 2 VMs within the same Host.
      2. The NIC is ConnectX-4Lx, using SRIOV and passing through 2 VFs to
         the each VM respectively.
      3. VMs' vCPUs are binded to different physical CPUs, and the binded
         physical CPUs are isolated by `isolcpus=xxx` cmdline.
      4. NICs' queue number are set to 1.
      Signed-off-by: NWen Gu <guwen@linux.alibaba.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b8d19945
    • W
      net/smc: Use sysctl-specified types of buffers in new link group · b984f370
      Wen Gu 提交于
      This patch introduces a new SMC-R specific element buf_type
      in struct smc_link_group, for recording the value of sysctl
      smcr_buf_type when link group is created.
      
      New created link group will create and reuse buffers of the
      type specified by buf_type.
      Signed-off-by: NWen Gu <guwen@linux.alibaba.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b984f370
    • W
      net/smc: Introduce a sysctl for setting SMC-R buffer type · 4bc5008e
      Wen Gu 提交于
      This patch introduces the sysctl smcr_buf_type for setting
      the type of SMC-R sndbufs and RMBs.
      
      Valid values includes:
      
      - SMCR_PHYS_CONT_BUFS, which means use physically contiguous
        buffers for better performance and is the default value.
      
      - SMCR_VIRT_CONT_BUFS, which means use virtually contiguous
        buffers in case of physically contiguous memory is scarce.
      
      - SMCR_MIXED_BUFS, which means first try to use physically
        contiguous buffers. If not available, then use virtually
        contiguous buffers.
      Signed-off-by: NWen Gu <guwen@linux.alibaba.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4bc5008e
    • G
      net/smc: optimize for smc_sndbuf_sync_sg_for_device and smc_rmb_sync_sg_for_cpu · 0ef69e78
      Guangguan Wang 提交于
      Some CPU, such as Xeon, can guarantee DMA cache coherency.
      So it is no need to use dma sync APIs to flush cache on such CPUs.
      In order to avoid calling dma sync APIs on the IO path, use the
      dma_need_sync to check whether smc_buf_desc needs dma sync when
      creating smc_buf_desc.
      Signed-off-by: NGuangguan Wang <guangguan.wang@linux.alibaba.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0ef69e78
    • G
      net/smc: remove redundant dma sync ops · 6d52e2de
      Guangguan Wang 提交于
      smc_ib_sync_sg_for_cpu/device are the ops used for dma memory cache
      consistency. Smc sndbufs are dma buffers, where CPU writes data to
      it and PCIE device reads data from it. So for sndbufs,
      smc_ib_sync_sg_for_device is needed and smc_ib_sync_sg_for_cpu is
      redundant as PCIE device will not write the buffers. Smc rmbs
      are dma buffers, where PCIE device write data to it and CPU read
      data from it. So for rmbs, smc_ib_sync_sg_for_cpu is needed and
      smc_ib_sync_sg_for_device is redundant as CPU will not write the buffers.
      Signed-off-by: NGuangguan Wang <guangguan.wang@linux.alibaba.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6d52e2de
  2. 16 7月, 2022 2 次提交
  3. 15 7月, 2022 5 次提交
  4. 14 7月, 2022 4 次提交
    • N
      ip: fix dflt addr selection for connected nexthop · 747c1430
      Nicolas Dichtel 提交于
      When a nexthop is added, without a gw address, the default scope was set
      to 'host'. Thus, when a source address is selected, 127.0.0.1 may be chosen
      but rejected when the route is used.
      
      When using a route without a nexthop id, the scope can be configured in the
      route, thus the problem doesn't exist.
      
      To explain more deeply: when a user creates a nexthop, it cannot specify
      the scope. To create it, the function nh_create_ipv4() calls fib_check_nh()
      with scope set to 0. fib_check_nh() calls fib_check_nh_nongw() wich was
      setting scope to 'host'. Then, nh_create_ipv4() calls
      fib_info_update_nhc_saddr() with scope set to 'host'. The src addr is
      chosen before the route is inserted.
      
      When a 'standard' route (ie without a reference to a nexthop) is added,
      fib_create_info() calls fib_info_update_nhc_saddr() with the scope set by
      the user. iproute2 set the scope to 'link' by default.
      
      Here is a way to reproduce the problem:
      ip netns add foo
      ip -n foo link set lo up
      ip netns add bar
      ip -n bar link set lo up
      sleep 1
      
      ip -n foo link add name eth0 type dummy
      ip -n foo link set eth0 up
      ip -n foo address add 192.168.0.1/24 dev eth0
      
      ip -n foo link add name veth0 type veth peer name veth1 netns bar
      ip -n foo link set veth0 up
      ip -n bar link set veth1 up
      
      ip -n bar address add 192.168.1.1/32 dev veth1
      ip -n bar route add default dev veth1
      
      ip -n foo nexthop add id 1 dev veth0
      ip -n foo route add 192.168.1.1 nhid 1
      
      Try to get/use the route:
      > $ ip -n foo route get 192.168.1.1
      > RTNETLINK answers: Invalid argument
      > $ ip netns exec foo ping -c1 192.168.1.1
      > ping: connect: Invalid argument
      
      Try without nexthop group (iproute2 sets scope to 'link' by dflt):
      ip -n foo route del 192.168.1.1
      ip -n foo route add 192.168.1.1 dev veth0
      
      Try to get/use the route:
      > $ ip -n foo route get 192.168.1.1
      > 192.168.1.1 dev veth0 src 192.168.0.1 uid 0
      >     cache
      > $ ip netns exec foo ping -c1 192.168.1.1
      > PING 192.168.1.1 (192.168.1.1) 56(84) bytes of data.
      > 64 bytes from 192.168.1.1: icmp_seq=1 ttl=64 time=0.039 ms
      >
      > --- 192.168.1.1 ping statistics ---
      > 1 packets transmitted, 1 received, 0% packet loss, time 0ms
      > rtt min/avg/max/mdev = 0.039/0.039/0.039/0.000 ms
      
      CC: stable@vger.kernel.org
      Fixes: 597cfe4f ("nexthop: Add support for IPv4 nexthops")
      Reported-by: NEdwin Brossette <edwin.brossette@6wind.com>
      Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Link: https://lore.kernel.org/r/20220713114853.29406-1-nicolas.dichtel@6wind.comSigned-off-by: NPaolo Abeni <pabeni@redhat.com>
      747c1430
    • A
      seg6: bpf: fix skb checksum in bpf_push_seg6_encap() · 4889fbd9
      Andrea Mayer 提交于
      Both helper functions bpf_lwt_seg6_action() and bpf_lwt_push_encap() use
      the bpf_push_seg6_encap() to encapsulate the packet in an IPv6 with Segment
      Routing Header (SRH) or insert an SRH between the IPv6 header and the
      payload.
      To achieve this result, such helper functions rely on bpf_push_seg6_encap()
      which, in turn, leverages seg6_do_srh_{encap,inline}() to perform the
      required operation (i.e. encap/inline).
      
      This patch removes the initialization of the IPv6 header payload length
      from bpf_push_seg6_encap(), as it is now handled properly by
      seg6_do_srh_{encap,inline}() to prevent corruption of the skb checksum.
      
      Fixes: fe94cc29 ("bpf: Add IPv6 Segment Routing helpers")
      Signed-off-by: NAndrea Mayer <andrea.mayer@uniroma2.it>
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      4889fbd9
    • A
      seg6: fix skb checksum in SRv6 End.B6 and End.B6.Encaps behaviors · f048880f
      Andrea Mayer 提交于
      The SRv6 End.B6 and End.B6.Encaps behaviors rely on functions
      seg6_do_srh_{encap,inline}() to, respectively: i) encapsulate the
      packet within an outer IPv6 header with the specified Segment Routing
      Header (SRH); ii) insert the specified SRH directly after the IPv6
      header of the packet.
      
      This patch removes the initialization of the IPv6 header payload length
      from the input_action_end_b6{_encap}() functions, as it is now handled
      properly by seg6_do_srh_{encap,inline}() to avoid corruption of the skb
      checksum.
      
      Fixes: 140f04c3 ("ipv6: sr: implement several seg6local actions")
      Signed-off-by: NAndrea Mayer <andrea.mayer@uniroma2.it>
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      f048880f
    • A
      seg6: fix skb checksum evaluation in SRH encapsulation/insertion · df8386d1
      Andrea Mayer 提交于
      Support for SRH encapsulation and insertion was introduced with
      commit 6c8702c6 ("ipv6: sr: add support for SRH encapsulation and
      injection with lwtunnels"), through the seg6_do_srh_encap() and
      seg6_do_srh_inline() functions, respectively.
      The former encapsulates the packet in an outer IPv6 header along with
      the SRH, while the latter inserts the SRH between the IPv6 header and
      the payload. Then, the headers are initialized/updated according to the
      operating mode (i.e., encap/inline).
      Finally, the skb checksum is calculated to reflect the changes applied
      to the headers.
      
      The IPv6 payload length ('payload_len') is not initialized
      within seg6_do_srh_{inline,encap}() but is deferred in seg6_do_srh(), i.e.
      the caller of seg6_do_srh_{inline,encap}().
      However, this operation invalidates the skb checksum, since the
      'payload_len' is updated only after the checksum is evaluated.
      
      To solve this issue, the initialization of the IPv6 payload length is
      moved from seg6_do_srh() directly into the seg6_do_srh_{inline,encap}()
      functions and before the skb checksum update takes place.
      
      Fixes: 6c8702c6 ("ipv6: sr: add support for SRH encapsulation and injection with lwtunnels")
      Reported-by: NPaolo Abeni <pabeni@redhat.com>
      Link: https://lore.kernel.org/all/20220705190727.69d532417be7438b15404ee1@uniroma2.itSigned-off-by: NAndrea Mayer <andrea.mayer@uniroma2.it>
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      df8386d1
  5. 13 7月, 2022 17 次提交