提交 · 541cc48be3b141e8529fef05ad6cedbca83f9e80 · openeuler / Kernel

18 7月, 2022 6 次提交

tls: rx: read the input skb from ctx->recv_pkt · 541cc48b

由 Jakub Kicinski 提交于 7月 14, 2022

Callers always pass ctx->recv_pkt into decrypt_skb_update(),
and it propagates it to its callees. This may give someone
the false impression that those functions can accept any valid
skb containing a TLS record. That's not the case, the record
sequence number is read from the context, and they can only
take the next record coming out of the strp.

Let the functions get the skb from the context instead of
passing it in. This will also make it cleaner to return
a different skb than ctx->recv_pkt as the decrypted one
later on.

Since we're touching the definition of decrypt_skb_update()
use this as an opportunity to rename it.
Signed-off-by: NJakub Kicinski <kuba@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

541cc48b

tls: rx: factor out device darg update · 8a958732

由 Jakub Kicinski 提交于 7月 14, 2022

I already forgot to transform darg from input to output
semantics once on the NIC inline crypto fastpath. To
avoid this happening again create a device equivalent
of decrypt_internal(). A function responsible for decryption
and transforming darg.

While at it rename decrypt_internal() to a hopefully slightly
more meaningful name.
Signed-off-by: NJakub Kicinski <kuba@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8a958732

tls: rx: remove the message decrypted tracking · 53d57999

由 Jakub Kicinski 提交于 7月 14, 2022

We no longer allow a decrypted skb to remain linked to ctx->recv_pkt.
Anything on the list is decrypted, anything on ctx->recv_pkt needs
to be decrypted.
Signed-off-by: NJakub Kicinski <kuba@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

53d57999

tls: rx: don't keep decrypted skbs on ctx->recv_pkt · abb47dc9

由 Jakub Kicinski 提交于 7月 14, 2022

Detach the skb from ctx->recv_pkt after decryption is done,
even if we can't consume it.
Signed-off-by: NJakub Kicinski <kuba@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

abb47dc9

tls: rx: don't try to keep the skbs always on the list · 008141de

由 Jakub Kicinski 提交于 7月 14, 2022

I thought that having the skb either always on the ctx->rx_list
or ctx->recv_pkt will simplify the handling, as we would not
have to remember to flip it from one to the other on exit paths.

This became a little harder to justify with the fix for BPF
sockmaps. Subsequent changes will make the situation even worse.
Queue the skbs only when really needed.
Signed-off-by: NJakub Kicinski <kuba@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

008141de

tls: rx: allow only one reader at a time · 4cbc325e

由 Jakub Kicinski 提交于 7月 14, 2022

recvmsg() in TLS gets data from the skb list (rx_list) or fresh
skbs we read from TCP via strparser. The former holds skbs which were
already decrypted for peek or decrypted and partially consumed.

tls_wait_data() only notices appearance of fresh skbs coming out
of TCP (or psock). It is possible, if there is a concurrent call
to peek() and recv() that the peek() will move the data from input
to rx_list without recv() noticing. recv() will then read data out
of order or never wake up.

This is not a practical use case/concern, but it makes the self
tests less reliable. This patch solves the problem by allowing
only one reader in.

Because having multiple processes calling read()/peek() is not
normal avoid adding a lock and try to fast-path the single reader
case.
Signed-off-by: NJakub Kicinski <kuba@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4cbc325e

15 7月, 2022 1 次提交

net/tls: Check for errors in tls_device_init · 3d8c51b2

由 Tariq Toukan 提交于 7月 14, 2022

Add missing error checks in tls_device_init.

Fixes: e8f69799 ("net/tls: Add generic NIC offload infrastructure")
Reported-by: NJakub Kicinski <kuba@kernel.org>
Reviewed-by: NMaxim Mikityanskiy <maximmi@nvidia.com>
Signed-off-by: NTariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20220714070754.1428-1-tariqt@nvidia.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

3d8c51b2

12 7月, 2022 3 次提交

tls: rx: fix the NoPad getsockopt · 57128e98

由 Jakub Kicinski 提交于 7月 08, 2022

Maxim reports do_tls_getsockopt_no_pad() will
always return an error. Indeed looks like refactoring
gone wrong - remove err and use value.
Reported-by: NMaxim Mikityanskiy <maximmi@nvidia.com>
Fixes: 88527790 ("tls: rx: add sockopt for enabling optimistic decrypt with TLS 1.3")
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

57128e98

tls: rx: add counter for NoPad violations · bb56cea9

由 Jakub Kicinski 提交于 7月 08, 2022

As discussed with Maxim add a counter for true NoPad violations.
This should help deployments catch unexpected padded records vs
just control records which always need re-encryption.

https: //lore.kernel.org/all/b111828e6ac34baad9f4e783127eba8344ac252d.camel@nvidia.com/
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

bb56cea9

tls: fix spelling of MIB · 1090c1ea

由 Jakub Kicinski 提交于 7月 08, 2022

MIN -> MIB

Fixes: 88527790 ("tls: rx: add sockopt for enabling optimistic decrypt with TLS 1.3")
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

1090c1ea

09 7月, 2022 5 次提交

tls: rx: make tls_wait_data() return an recvmsg retcode · 35560b7f

由 Jakub Kicinski 提交于 7月 07, 2022

tls_wait_data() sets the return code as an output parameter
and always returns ctx->recv_pkt on success.

Return the error code directly and let the caller read the skb
from the context. Use positive return code to indicate ctx->recv_pkt
is ready.

While touching the definition of the function rename it.
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

35560b7f

tls: create an internal header · 58790314

由 Jakub Kicinski 提交于 7月 07, 2022

include/net/tls.h is getting a little long, and is probably hard
for driver authors to navigate. Split out the internals into a
header which will live under net/tls/. While at it move some
static inlines with a single user into the source files, add
a few tls_ prefixes and fix spelling of 'proccess'.
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

58790314

tls: rx: coalesce exit paths in tls_decrypt_sg() · 03957d84

由 Jakub Kicinski 提交于 7月 07, 2022

Jump to the free() call, instead of having to remember
to free the memory in multiple places.
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

03957d84

tls: rx: wrap decrypt params in a struct · b89fec54

由 Jakub Kicinski 提交于 7月 07, 2022

The max size of iv + aad + tail is 22B. That's smaller
than a single sg entry (32B). Don't bother with the
memory packing, just create a struct which holds the
max size of those members.
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

b89fec54

tls: rx: always allocate max possible aad size for decrypt · 50a07aa5

由 Jakub Kicinski 提交于 7月 07, 2022

AAD size is either 5 or 13. Really no point complicating
the code for the 8B of difference. This will also let us
turn the chunked up buffer into a sane struct.
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

50a07aa5

06 7月, 2022 5 次提交

Revert "tls: rx: move counting TlsDecryptErrors for sync" · a069a905

由 Gal Pressman 提交于 7月 05, 2022

This reverts commit 284b4d93.
When using TLS device offload and coming from tls_device_reencrypt()
flow, -EBADMSG error in tls_do_decryption() should not be counted
towards the TLSTlsDecryptError counter.

Move the counter increase back to the decrypt_internal() call site in
decrypt_skb_update().
This also fixes an issue where:
	if (n_sgin < 1)
		return -EBADMSG;

Errors in decrypt_internal() were not counted after the cited patch.

Fixes: 284b4d93 ("tls: rx: move counting TlsDecryptErrors for sync")
Cc: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: NMaxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: NTariq Toukan <tariqt@nvidia.com>
Signed-off-by: NGal Pressman <gal@nvidia.com>
Reviewed-by: NJakub Kicinski <kuba@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a069a905

tls: rx: periodically flush socket backlog · c46b0183

由 Jakub Kicinski 提交于 7月 05, 2022

We continuously hold the socket lock during large reads and writes.
This may inflate RTT and negatively impact TCP performance.
Flush the backlog periodically. I tried to pick a flush period (128kB)
which gives significant benefit but the max Bps rate is not yet visibly
impacted.
Signed-off-by: NJakub Kicinski <kuba@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c46b0183

tls: rx: add sockopt for enabling optimistic decrypt with TLS 1.3 · 88527790

由 Jakub Kicinski 提交于 7月 05, 2022

Since optimisitic decrypt may add extra load in case of retries
require socket owner to explicitly opt-in.
Signed-off-by: NJakub Kicinski <kuba@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

88527790

tls: rx: support optimistic decrypt to user buffer with TLS 1.3 · ce61327c

由 Jakub Kicinski 提交于 7月 05, 2022

We currently don't support decrypt to user buffer with TLS 1.3
because we don't know the record type and how much padding
record contains before decryption. In practice data records
are by far most common and padding gets used rarely so
we can assume data record, no padding, and if we find out
that wasn't the case - retry the crypto in place (decrypt
to skb).

To safeguard from user overwriting content type and padding
before we can check it attach a 1B sg entry where last byte
of the record will land.
Signed-off-by: NJakub Kicinski <kuba@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ce61327c

tls: rx: don't include tail size in data_len · 603380f5

由 Jakub Kicinski 提交于 7月 05, 2022

To make future patches easier to review make data_len
contain the length of the data, without the tail.
Signed-off-by: NJakub Kicinski <kuba@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

603380f5

02 7月, 2022 1 次提交

net: add skb_[inner_]tcp_all_headers helpers · 504148fe

由 Eric Dumazet 提交于 6月 30, 2022

Most drivers use "skb_transport_offset(skb) + tcp_hdrlen(skb)"
to compute headers length for a TCP packet, but others
use more convoluted (but equivalent) ways.

Add skb_tcp_all_headers() and skb_inner_tcp_all_headers()
helpers to harmonize this a bit.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

504148fe

23 6月, 2022 2 次提交

sock: redo the psock vs ULP protection check · e34a07c0

由 Jakub Kicinski 提交于 6月 20, 2022

Commit 8a59f9d1 ("sock: Introduce sk->sk_prot->psock_update_sk_prot()")
has moved the inet_csk_has_ulp(sk) check from sk_psock_init() to
the new tcp_bpf_update_proto() function. I'm guessing that this
was done to allow creating psocks for non-inet sockets.

Unfortunately the destruction path for psock includes the ULP
unwind, so we need to fail the sk_psock_init() itself.
Otherwise if ULP is already present we'll notice that later,
and call tcp_update_ulp() with the sk_proto of the ULP
itself, which will most likely result in the ULP looping
its callbacks.

Fixes: 8a59f9d1 ("sock: Introduce sk->sk_prot->psock_update_sk_prot()")
Signed-off-by: NJakub Kicinski <kuba@kernel.org>
Reviewed-by: NJohn Fastabend <john.fastabend@gmail.com>
Reviewed-by: NJakub Sitnicki <jakub@cloudflare.com>
Tested-by: NJakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/r/20220620191353.1184629-2-kuba@kernel.orgSigned-off-by: NPaolo Abeni <pabeni@redhat.com>

e34a07c0

Revert "net/tls: fix tls_sk_proto_close executed repeatedly" · 1b205d94

由 Jakub Kicinski 提交于 6月 20, 2022

This reverts commit 69135c57.

This commit was just papering over the issue, ULP should not
get ->update() called with its own sk_prot. Each ULP would
need to add this check.

Fixes: 69135c57 ("net/tls: fix tls_sk_proto_close executed repeatedly")
Signed-off-by: NJakub Kicinski <kuba@kernel.org>
Reviewed-by: NJohn Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/r/20220620191353.1184629-1-kuba@kernel.orgSigned-off-by: NPaolo Abeni <pabeni@redhat.com>

1b205d94

20 6月, 2022 1 次提交

net/tls: fix tls_sk_proto_close executed repeatedly · 69135c57

由 Ziyang Xuan 提交于 6月 20, 2022

After setting the sock ktls, update ctx->sk_proto to sock->sk_prot by
tls_update(), so now ctx->sk_proto->close is tls_sk_proto_close(). When
close the sock, tls_sk_proto_close() is called for sock->sk_prot->close
is tls_sk_proto_close(). But ctx->sk_proto->close() will be executed later
in tls_sk_proto_close(). Thus tls_sk_proto_close() executed repeatedly
occurred. That will trigger the following bug.

=================================================================
KASAN: null-ptr-deref in range [0x0000000000000010-0x0000000000000017]
RIP: 0010:tls_sk_proto_close+0xd8/0xaf0 net/tls/tls_main.c:306
Call Trace:
 <TASK>
 tls_sk_proto_close+0x356/0xaf0 net/tls/tls_main.c:329
 inet_release+0x12e/0x280 net/ipv4/af_inet.c:428
 __sock_release+0xcd/0x280 net/socket.c:650
 sock_close+0x18/0x20 net/socket.c:1365

Updating a proto which is same with sock->sk_prot is incorrect. Add proto
and sock->sk_prot equality check at the head of tls_update() to fix it.

Fixes: 95fa1454 ("bpf: sockmap/tls, close can race with map free")
Reported-by: syzbot+29c3c12f3214b85ad081@syzkaller.appspotmail.com
Signed-off-by: NZiyang Xuan <william.xuanziyang@huawei.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

69135c57

10 6月, 2022 1 次提交

tls: Rename TLS_INFO_ZC_SENDFILE to TLS_INFO_ZC_TX · b489a6e5

由 Maxim Mikityanskiy 提交于 6月 08, 2022

To embrace possible future optimizations of TLS, rename zerocopy
sendfile definitions to more generic ones:

* setsockopt: TLS_TX_ZEROCOPY_SENDFILE- > TLS_TX_ZEROCOPY_RO
* sock_diag: TLS_INFO_ZC_SENDFILE -> TLS_INFO_ZC_RO_TX

RO stands for readonly and emphasizes that the application shouldn't
modify the data being transmitted with zerocopy to avoid potential
disconnection.

Fixes: c1318b39 ("tls: Add opt-in zerocopy mode of sendfile()")
Signed-off-by: NMaxim Mikityanskiy <maximmi@nvidia.com>
Link: https://lore.kernel.org/r/20220608153425.3151146-1-maximmi@nvidia.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

b489a6e5

20 5月, 2022 1 次提交

net: tls: fix messing up lists when bpf enabled · 1c213311

由 Jakub Kicinski 提交于 5月 18, 2022

Artem points out that skb may try to take over the skb and
queue it to its own list. Unlink the skb before calling out.

Fixes: b1a2c178 ("tls: rx: clear ctx->recv_pkt earlier")
Reported-by: NArtem Savkov <asavkov@redhat.com>
Tested-by: NArtem Savkov <asavkov@redhat.com>
Link: https://lore.kernel.org/r/20220518205644.2059468-1-kuba@kernel.orgSigned-off-by: NJakub Kicinski <kuba@kernel.org>

1c213311

19 5月, 2022 1 次提交

tls: Add opt-in zerocopy mode of sendfile() · c1318b39

由 Boris Pismenny 提交于 5月 18, 2022

TLS device offload copies sendfile data to a bounce buffer before
transmitting. It allows to maintain the valid MAC on TLS records when
the file contents change and a part of TLS record has to be
retransmitted on TCP level.

In many common use cases (like serving static files over HTTPS) the file
contents are not changed on the fly. In many use cases breaking the
connection is totally acceptable if the file is changed during
transmission, because it would be received corrupted in any case.

This commit allows to optimize performance for such use cases to
providing a new optional mode of TLS sendfile(), in which the extra copy
is skipped. Removing this copy improves performance significantly, as
TLS and TCP sendfile perform the same operations, and the only overhead
is TLS header/trailer insertion.

The new mode can only be enabled with the new socket option named
TLS_TX_ZEROCOPY_SENDFILE on per-socket basis. It preserves backwards
compatibility with existing applications that rely on the copying
behavior.

The new mode is safe, meaning that unsolicited modifications of the file
being sent can't break integrity of the kernel. The worst thing that can
happen is sending a corrupted TLS record, which is in any case not
forbidden when using regular TCP sockets.

Sockets other than TLS device offload are not affected by the new socket
option. The actual status of zerocopy sendfile can be queried with
sock_diag.

Performance numbers in a single-core test with 24 HTTPS streams on
nginx, under 100% CPU load:

* non-zerocopy: 33.6 Gbit/s
* zerocopy: 79.92 Gbit/s

CPU: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz
Signed-off-by: NBoris Pismenny <borisp@nvidia.com>
Signed-off-by: NTariq Toukan <tariqt@nvidia.com>
Signed-off-by: NMaxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: NJakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20220518092731.1243494-1-maximmi@nvidia.comSigned-off-by: NPaolo Abeni <pabeni@redhat.com>

c1318b39

13 5月, 2022 1 次提交

tls: Fix context leak on tls_device_down · 3740651b

由 Maxim Mikityanskiy 提交于 5月 12, 2022

The commit cited below claims to fix a use-after-free condition after
tls_device_down. Apparently, the description wasn't fully accurate. The
context stayed alive, but ctx->netdev became NULL, and the offload was
torn down without a proper fallback, so a bug was present, but a
different kind of bug.

Due to misunderstanding of the issue, the original patch dropped the
refcount_dec_and_test line for the context to avoid the alleged
premature deallocation. That line has to be restored, because it matches
the refcount_inc_not_zero from the same function, otherwise the contexts
that survived tls_device_down are leaked.

This patch fixes the described issue by restoring refcount_dec_and_test.
After this change, there is no leak anymore, and the fallback to
software kTLS still works.

Fixes: c55dcdd4 ("net/tls: Fix use-after-free after the TLS device goes down and up")
Signed-off-by: NMaxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: NTariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20220512091830.678684-1-maximmi@nvidia.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

3740651b

28 4月, 2022 1 次提交

tls: Skip tls_append_frag on zero copy size · a0df7194

由 Maxim Mikityanskiy 提交于 4月 26, 2022

Calling tls_append_frag when max_open_record_len == record->len might
add an empty fragment to the TLS record if the call happens to be on the
page boundary. Normally tls_append_frag coalesces the zero-sized
fragment to the previous one, but not if it's on page boundary.

If a resync happens then, the mlx5 driver posts dump WQEs in
tx_post_resync_dump, and the empty fragment may become a data segment
with byte_count == 0, which will confuse the NIC and lead to a CQE
error.

This commit fixes the described issue by skipping tls_append_frag on
zero size to avoid adding empty fragments. The fix is not in the driver,
because an empty fragment is hardly the desired behavior.

Fixes: e8f69799 ("net/tls: Add generic NIC offload infrastructure")
Signed-off-by: NMaxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: NTariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20220426154949.159055-1-maximmi@nvidia.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

a0df7194

27 4月, 2022 2 次提交

net: tls: fix async vs NIC crypto offload · c706b2b5

由 Jakub Kicinski 提交于 4月 25, 2022

When NIC takes care of crypto (or the record has already
been decrypted) we forget to update darg->async. ->async
is supposed to mean whether record is async capable on
input and whether record has been queued for async crypto
on output.
Reported-by: NGal Pressman <gal@nvidia.com>
Fixes: 3547a1f9 ("tls: rx: use async as an in-out argument")
Tested-by: NGal Pressman <gal@nvidia.com>
Link: https://lore.kernel.org/r/20220425233309.344858-1-kuba@kernel.orgSigned-off-by: NJakub Kicinski <kuba@kernel.org>

c706b2b5

net: generalize skb freeing deferral to per-cpu lists · 68822bdf

由 Eric Dumazet 提交于 4月 22, 2022

Logic added in commit f35f8219 ("tcp: defer skb freeing after socket
lock is released") helped bulk TCP flows to move the cost of skbs
frees outside of critical section where socket lock was held.

But for RPC traffic, or hosts with RFS enabled, the solution is far from
being ideal.

For RPC traffic, recvmsg() has to return to user space right after
skb payload has been consumed, meaning that BH handler has no chance
to pick the skb before recvmsg() thread. This issue is more visible
with BIG TCP, as more RPC fit one skb.

For RFS, even if BH handler picks the skbs, they are still picked
from the cpu on which user thread is running.

Ideally, it is better to free the skbs (and associated page frags)
on the cpu that originally allocated them.

This patch removes the per socket anchor (sk->defer_list) and
instead uses a per-cpu list, which will hold more skbs per round.

This new per-cpu list is drained at the end of net_action_rx(),
after incoming packets have been processed, to lower latencies.

In normal conditions, skbs are added to the per-cpu list with
no further action. In the (unlikely) cases where the cpu does not
run net_action_rx() handler fast enough, we use an IPI to raise
NET_RX_SOFTIRQ on the remote cpu.

Also, we do not bother draining the per-cpu list from dev_cpu_dead()
This is because skbs in this list have no requirement on how fast
they should be freed.

Note that we can add in the future a small per-cpu cache
if we see any contention on sd->defer_lock.

Tested on a pair of hosts with 100Gbit NIC, RFS enabled,
and /proc/sys/net/ipv4/tcp_rmem[2] tuned to 16MB to work around
page recycling strategy used by NIC driver (its page pool capacity
being too small compared to number of skbs/pages held in sockets
receive queues)

Note that this tuning was only done to demonstrate worse
conditions for skb freeing for this particular test.
These conditions can happen in more general production workload.

10 runs of one TCP_STREAM flow

Before:
Average throughput: 49685 Mbit.

Kernel profiles on cpu running user thread recvmsg() show high cost for
skb freeing related functions (*)

    57.81%  [kernel]       [k] copy_user_enhanced_fast_string
(*) 12.87%  [kernel]       [k] skb_release_data
(*)  4.25%  [kernel]       [k] __free_one_page
(*)  3.57%  [kernel]       [k] __list_del_entry_valid
     1.85%  [kernel]       [k] __netif_receive_skb_core
     1.60%  [kernel]       [k] __skb_datagram_iter
(*)  1.59%  [kernel]       [k] free_unref_page_commit
(*)  1.16%  [kernel]       [k] __slab_free
     1.16%  [kernel]       [k] _copy_to_iter
(*)  1.01%  [kernel]       [k] kfree
(*)  0.88%  [kernel]       [k] free_unref_page
     0.57%  [kernel]       [k] ip6_rcv_core
     0.55%  [kernel]       [k] ip6t_do_table
     0.54%  [kernel]       [k] flush_smp_call_function_queue
(*)  0.54%  [kernel]       [k] free_pcppages_bulk
     0.51%  [kernel]       [k] llist_reverse_order
     0.38%  [kernel]       [k] process_backlog
(*)  0.38%  [kernel]       [k] free_pcp_prepare
     0.37%  [kernel]       [k] tcp_recvmsg_locked
(*)  0.37%  [kernel]       [k] __list_add_valid
     0.34%  [kernel]       [k] sock_rfree
     0.34%  [kernel]       [k] _raw_spin_lock_irq
(*)  0.33%  [kernel]       [k] __page_cache_release
     0.33%  [kernel]       [k] tcp_v6_rcv
(*)  0.33%  [kernel]       [k] __put_page
(*)  0.29%  [kernel]       [k] __mod_zone_page_state
     0.27%  [kernel]       [k] _raw_spin_lock

After patch:
Average throughput: 73076 Mbit.

Kernel profiles on cpu running user thread recvmsg() looks better:

    81.35%  [kernel]       [k] copy_user_enhanced_fast_string
     1.95%  [kernel]       [k] _copy_to_iter
     1.95%  [kernel]       [k] __skb_datagram_iter
     1.27%  [kernel]       [k] __netif_receive_skb_core
     1.03%  [kernel]       [k] ip6t_do_table
     0.60%  [kernel]       [k] sock_rfree
     0.50%  [kernel]       [k] tcp_v6_rcv
     0.47%  [kernel]       [k] ip6_rcv_core
     0.45%  [kernel]       [k] read_tsc
     0.44%  [kernel]       [k] _raw_spin_lock_irqsave
     0.37%  [kernel]       [k] _raw_spin_lock
     0.37%  [kernel]       [k] native_irq_return_iret
     0.33%  [kernel]       [k] __inet6_lookup_established
     0.31%  [kernel]       [k] ip6_protocol_deliver_rcu
     0.29%  [kernel]       [k] tcp_rcv_established
     0.29%  [kernel]       [k] llist_reverse_order

v2: kdoc issue (kernel bots)
    do not defer if (alloc_cpu == smp_processor_id()) (Paolo)
    replace the sk_buff_head with a single-linked list (Jakub)
    add a READ_ONCE()/WRITE_ONCE() for the lockless read of sd->defer_list
Signed-off-by: NEric Dumazet <edumazet@google.com>
Acked-by: NPaolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/20220422201237.416238-1-eric.dumazet@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

68822bdf

13 4月, 2022 9 次提交

tls: rx: only copy IV from the packet for TLS 1.2 · a4ae58cd

由 Jakub Kicinski 提交于 4月 11, 2022

TLS 1.3 and ChaChaPoly don't carry IV in the packet.
The code before this change would copy out iv_size
worth of whatever followed the TLS header in the packet
and then for TLS 1.3 | ChaCha overwrite that with
the sequence number. Waste of cycles especially
with TLS 1.2 being close to dead and TLS 1.3 being
the common case.
Signed-off-by: NJakub Kicinski <kuba@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a4ae58cd

tls: rx: use MAX_IV_SIZE for allocations · f7d45f4b

由 Jakub Kicinski 提交于 4月 11, 2022

IVs are 8 or 16 bytes, no point reading out the exact value
for quantities this small.
Signed-off-by: NJakub Kicinski <kuba@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f7d45f4b

tls: rx: use async as an in-out argument · 3547a1f9

由 Jakub Kicinski 提交于 4月 11, 2022

Propagating EINPROGRESS thru multiple layers of functions is
error prone. Use darg->async as an in/out argument, like we
use darg->zc today. On input it tells the code if async is
allowed, on output if it took place.
Signed-off-by: NJakub Kicinski <kuba@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3547a1f9

tls: rx: return the already-copied data on crypto error · f314bfee

由 Jakub Kicinski 提交于 4月 11, 2022

async crypto handler will report the socket error no need
to report it again. We can, however, let the data we already
copied be reported to user space but we need to make sure
the error will be reported next time around.
Signed-off-by: NJakub Kicinski <kuba@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f314bfee

tls: rx: treat process_rx_list() errors as transient · 4dcdd971

由 Jakub Kicinski 提交于 4月 11, 2022

process_rx_list() only fails if it can't copy data to user
space. There is no point recording the error onto sk->sk_err
or giving up on the data which was read partially. Treat
the return value like a normal socket partial read.
Signed-off-by: NJakub Kicinski <kuba@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4dcdd971

tls: rx: assume crypto always calls our callback · 1c699ffa

由 Jakub Kicinski 提交于 4月 11, 2022

If crypto didn't always invoke our callback for async
we'd not be clearing skb->sk and would crash in the
skb core when freeing it. This if must be dead code.
Signed-off-by: NJakub Kicinski <kuba@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1c699ffa

tls: rx: don't handle TLS 1.3 in the async crypto callback · 72f3ad73

由 Jakub Kicinski 提交于 4月 11, 2022

Async crypto never worked with TLS 1.3 and was explicitly disabled in
commit 8497ded2 ("net/tls: Disable async decrytion for tls1.3").
There's no need for us to handle TLS 1.3 padding in the async cb.
Signed-off-by: NJakub Kicinski <kuba@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

72f3ad73

tls: rx: move counting TlsDecryptErrors for sync · 284b4d93

由 Jakub Kicinski 提交于 4月 11, 2022

Move counting TlsDecryptErrors to tls_do_decryption()
where differences between sync and async crypto are
reconciled.

No functional changes, this code just always gave
me a pause.
Signed-off-by: NJakub Kicinski <kuba@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

284b4d93

tls: rx: reuse leave_on_list label for psock · 0775639c

由 Jakub Kicinski 提交于 4月 11, 2022

The code is identical, we can save a few LoC.
Signed-off-by: NJakub Kicinski <kuba@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0775639c

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功