1. 18 7月, 2022 6 次提交
  2. 15 7月, 2022 1 次提交
  3. 12 7月, 2022 3 次提交
  4. 09 7月, 2022 5 次提交
  5. 06 7月, 2022 5 次提交
  6. 02 7月, 2022 1 次提交
  7. 23 6月, 2022 2 次提交
  8. 20 6月, 2022 1 次提交
    • Z
      net/tls: fix tls_sk_proto_close executed repeatedly · 69135c57
      Ziyang Xuan 提交于
      After setting the sock ktls, update ctx->sk_proto to sock->sk_prot by
      tls_update(), so now ctx->sk_proto->close is tls_sk_proto_close(). When
      close the sock, tls_sk_proto_close() is called for sock->sk_prot->close
      is tls_sk_proto_close(). But ctx->sk_proto->close() will be executed later
      in tls_sk_proto_close(). Thus tls_sk_proto_close() executed repeatedly
      occurred. That will trigger the following bug.
      
      =================================================================
      KASAN: null-ptr-deref in range [0x0000000000000010-0x0000000000000017]
      RIP: 0010:tls_sk_proto_close+0xd8/0xaf0 net/tls/tls_main.c:306
      Call Trace:
       <TASK>
       tls_sk_proto_close+0x356/0xaf0 net/tls/tls_main.c:329
       inet_release+0x12e/0x280 net/ipv4/af_inet.c:428
       __sock_release+0xcd/0x280 net/socket.c:650
       sock_close+0x18/0x20 net/socket.c:1365
      
      Updating a proto which is same with sock->sk_prot is incorrect. Add proto
      and sock->sk_prot equality check at the head of tls_update() to fix it.
      
      Fixes: 95fa1454 ("bpf: sockmap/tls, close can race with map free")
      Reported-by: syzbot+29c3c12f3214b85ad081@syzkaller.appspotmail.com
      Signed-off-by: NZiyang Xuan <william.xuanziyang@huawei.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      69135c57
  9. 10 6月, 2022 1 次提交
  10. 20 5月, 2022 1 次提交
  11. 19 5月, 2022 1 次提交
    • B
      tls: Add opt-in zerocopy mode of sendfile() · c1318b39
      Boris Pismenny 提交于
      TLS device offload copies sendfile data to a bounce buffer before
      transmitting. It allows to maintain the valid MAC on TLS records when
      the file contents change and a part of TLS record has to be
      retransmitted on TCP level.
      
      In many common use cases (like serving static files over HTTPS) the file
      contents are not changed on the fly. In many use cases breaking the
      connection is totally acceptable if the file is changed during
      transmission, because it would be received corrupted in any case.
      
      This commit allows to optimize performance for such use cases to
      providing a new optional mode of TLS sendfile(), in which the extra copy
      is skipped. Removing this copy improves performance significantly, as
      TLS and TCP sendfile perform the same operations, and the only overhead
      is TLS header/trailer insertion.
      
      The new mode can only be enabled with the new socket option named
      TLS_TX_ZEROCOPY_SENDFILE on per-socket basis. It preserves backwards
      compatibility with existing applications that rely on the copying
      behavior.
      
      The new mode is safe, meaning that unsolicited modifications of the file
      being sent can't break integrity of the kernel. The worst thing that can
      happen is sending a corrupted TLS record, which is in any case not
      forbidden when using regular TCP sockets.
      
      Sockets other than TLS device offload are not affected by the new socket
      option. The actual status of zerocopy sendfile can be queried with
      sock_diag.
      
      Performance numbers in a single-core test with 24 HTTPS streams on
      nginx, under 100% CPU load:
      
      * non-zerocopy: 33.6 Gbit/s
      * zerocopy: 79.92 Gbit/s
      
      CPU: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz
      Signed-off-by: NBoris Pismenny <borisp@nvidia.com>
      Signed-off-by: NTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: NMaxim Mikityanskiy <maximmi@nvidia.com>
      Reviewed-by: NJakub Kicinski <kuba@kernel.org>
      Link: https://lore.kernel.org/r/20220518092731.1243494-1-maximmi@nvidia.comSigned-off-by: NPaolo Abeni <pabeni@redhat.com>
      c1318b39
  12. 13 5月, 2022 1 次提交
    • M
      tls: Fix context leak on tls_device_down · 3740651b
      Maxim Mikityanskiy 提交于
      The commit cited below claims to fix a use-after-free condition after
      tls_device_down. Apparently, the description wasn't fully accurate. The
      context stayed alive, but ctx->netdev became NULL, and the offload was
      torn down without a proper fallback, so a bug was present, but a
      different kind of bug.
      
      Due to misunderstanding of the issue, the original patch dropped the
      refcount_dec_and_test line for the context to avoid the alleged
      premature deallocation. That line has to be restored, because it matches
      the refcount_inc_not_zero from the same function, otherwise the contexts
      that survived tls_device_down are leaked.
      
      This patch fixes the described issue by restoring refcount_dec_and_test.
      After this change, there is no leak anymore, and the fallback to
      software kTLS still works.
      
      Fixes: c55dcdd4 ("net/tls: Fix use-after-free after the TLS device goes down and up")
      Signed-off-by: NMaxim Mikityanskiy <maximmi@nvidia.com>
      Reviewed-by: NTariq Toukan <tariqt@nvidia.com>
      Link: https://lore.kernel.org/r/20220512091830.678684-1-maximmi@nvidia.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      3740651b
  13. 28 4月, 2022 1 次提交
  14. 27 4月, 2022 2 次提交
    • J
      net: tls: fix async vs NIC crypto offload · c706b2b5
      Jakub Kicinski 提交于
      When NIC takes care of crypto (or the record has already
      been decrypted) we forget to update darg->async. ->async
      is supposed to mean whether record is async capable on
      input and whether record has been queued for async crypto
      on output.
      Reported-by: NGal Pressman <gal@nvidia.com>
      Fixes: 3547a1f9 ("tls: rx: use async as an in-out argument")
      Tested-by: NGal Pressman <gal@nvidia.com>
      Link: https://lore.kernel.org/r/20220425233309.344858-1-kuba@kernel.orgSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      c706b2b5
    • E
      net: generalize skb freeing deferral to per-cpu lists · 68822bdf
      Eric Dumazet 提交于
      Logic added in commit f35f8219 ("tcp: defer skb freeing after socket
      lock is released") helped bulk TCP flows to move the cost of skbs
      frees outside of critical section where socket lock was held.
      
      But for RPC traffic, or hosts with RFS enabled, the solution is far from
      being ideal.
      
      For RPC traffic, recvmsg() has to return to user space right after
      skb payload has been consumed, meaning that BH handler has no chance
      to pick the skb before recvmsg() thread. This issue is more visible
      with BIG TCP, as more RPC fit one skb.
      
      For RFS, even if BH handler picks the skbs, they are still picked
      from the cpu on which user thread is running.
      
      Ideally, it is better to free the skbs (and associated page frags)
      on the cpu that originally allocated them.
      
      This patch removes the per socket anchor (sk->defer_list) and
      instead uses a per-cpu list, which will hold more skbs per round.
      
      This new per-cpu list is drained at the end of net_action_rx(),
      after incoming packets have been processed, to lower latencies.
      
      In normal conditions, skbs are added to the per-cpu list with
      no further action. In the (unlikely) cases where the cpu does not
      run net_action_rx() handler fast enough, we use an IPI to raise
      NET_RX_SOFTIRQ on the remote cpu.
      
      Also, we do not bother draining the per-cpu list from dev_cpu_dead()
      This is because skbs in this list have no requirement on how fast
      they should be freed.
      
      Note that we can add in the future a small per-cpu cache
      if we see any contention on sd->defer_lock.
      
      Tested on a pair of hosts with 100Gbit NIC, RFS enabled,
      and /proc/sys/net/ipv4/tcp_rmem[2] tuned to 16MB to work around
      page recycling strategy used by NIC driver (its page pool capacity
      being too small compared to number of skbs/pages held in sockets
      receive queues)
      
      Note that this tuning was only done to demonstrate worse
      conditions for skb freeing for this particular test.
      These conditions can happen in more general production workload.
      
      10 runs of one TCP_STREAM flow
      
      Before:
      Average throughput: 49685 Mbit.
      
      Kernel profiles on cpu running user thread recvmsg() show high cost for
      skb freeing related functions (*)
      
          57.81%  [kernel]       [k] copy_user_enhanced_fast_string
      (*) 12.87%  [kernel]       [k] skb_release_data
      (*)  4.25%  [kernel]       [k] __free_one_page
      (*)  3.57%  [kernel]       [k] __list_del_entry_valid
           1.85%  [kernel]       [k] __netif_receive_skb_core
           1.60%  [kernel]       [k] __skb_datagram_iter
      (*)  1.59%  [kernel]       [k] free_unref_page_commit
      (*)  1.16%  [kernel]       [k] __slab_free
           1.16%  [kernel]       [k] _copy_to_iter
      (*)  1.01%  [kernel]       [k] kfree
      (*)  0.88%  [kernel]       [k] free_unref_page
           0.57%  [kernel]       [k] ip6_rcv_core
           0.55%  [kernel]       [k] ip6t_do_table
           0.54%  [kernel]       [k] flush_smp_call_function_queue
      (*)  0.54%  [kernel]       [k] free_pcppages_bulk
           0.51%  [kernel]       [k] llist_reverse_order
           0.38%  [kernel]       [k] process_backlog
      (*)  0.38%  [kernel]       [k] free_pcp_prepare
           0.37%  [kernel]       [k] tcp_recvmsg_locked
      (*)  0.37%  [kernel]       [k] __list_add_valid
           0.34%  [kernel]       [k] sock_rfree
           0.34%  [kernel]       [k] _raw_spin_lock_irq
      (*)  0.33%  [kernel]       [k] __page_cache_release
           0.33%  [kernel]       [k] tcp_v6_rcv
      (*)  0.33%  [kernel]       [k] __put_page
      (*)  0.29%  [kernel]       [k] __mod_zone_page_state
           0.27%  [kernel]       [k] _raw_spin_lock
      
      After patch:
      Average throughput: 73076 Mbit.
      
      Kernel profiles on cpu running user thread recvmsg() looks better:
      
          81.35%  [kernel]       [k] copy_user_enhanced_fast_string
           1.95%  [kernel]       [k] _copy_to_iter
           1.95%  [kernel]       [k] __skb_datagram_iter
           1.27%  [kernel]       [k] __netif_receive_skb_core
           1.03%  [kernel]       [k] ip6t_do_table
           0.60%  [kernel]       [k] sock_rfree
           0.50%  [kernel]       [k] tcp_v6_rcv
           0.47%  [kernel]       [k] ip6_rcv_core
           0.45%  [kernel]       [k] read_tsc
           0.44%  [kernel]       [k] _raw_spin_lock_irqsave
           0.37%  [kernel]       [k] _raw_spin_lock
           0.37%  [kernel]       [k] native_irq_return_iret
           0.33%  [kernel]       [k] __inet6_lookup_established
           0.31%  [kernel]       [k] ip6_protocol_deliver_rcu
           0.29%  [kernel]       [k] tcp_rcv_established
           0.29%  [kernel]       [k] llist_reverse_order
      
      v2: kdoc issue (kernel bots)
          do not defer if (alloc_cpu == smp_processor_id()) (Paolo)
          replace the sk_buff_head with a single-linked list (Jakub)
          add a READ_ONCE()/WRITE_ONCE() for the lockless read of sd->defer_list
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NPaolo Abeni <pabeni@redhat.com>
      Link: https://lore.kernel.org/r/20220422201237.416238-1-eric.dumazet@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      68822bdf
  15. 13 4月, 2022 9 次提交