- 18 7月, 2022 6 次提交
-
-
由 Jakub Kicinski 提交于
Callers always pass ctx->recv_pkt into decrypt_skb_update(), and it propagates it to its callees. This may give someone the false impression that those functions can accept any valid skb containing a TLS record. That's not the case, the record sequence number is read from the context, and they can only take the next record coming out of the strp. Let the functions get the skb from the context instead of passing it in. This will also make it cleaner to return a different skb than ctx->recv_pkt as the decrypted one later on. Since we're touching the definition of decrypt_skb_update() use this as an opportunity to rename it. Signed-off-by: NJakub Kicinski <kuba@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jakub Kicinski 提交于
I already forgot to transform darg from input to output semantics once on the NIC inline crypto fastpath. To avoid this happening again create a device equivalent of decrypt_internal(). A function responsible for decryption and transforming darg. While at it rename decrypt_internal() to a hopefully slightly more meaningful name. Signed-off-by: NJakub Kicinski <kuba@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jakub Kicinski 提交于
We no longer allow a decrypted skb to remain linked to ctx->recv_pkt. Anything on the list is decrypted, anything on ctx->recv_pkt needs to be decrypted. Signed-off-by: NJakub Kicinski <kuba@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jakub Kicinski 提交于
Detach the skb from ctx->recv_pkt after decryption is done, even if we can't consume it. Signed-off-by: NJakub Kicinski <kuba@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jakub Kicinski 提交于
I thought that having the skb either always on the ctx->rx_list or ctx->recv_pkt will simplify the handling, as we would not have to remember to flip it from one to the other on exit paths. This became a little harder to justify with the fix for BPF sockmaps. Subsequent changes will make the situation even worse. Queue the skbs only when really needed. Signed-off-by: NJakub Kicinski <kuba@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jakub Kicinski 提交于
recvmsg() in TLS gets data from the skb list (rx_list) or fresh skbs we read from TCP via strparser. The former holds skbs which were already decrypted for peek or decrypted and partially consumed. tls_wait_data() only notices appearance of fresh skbs coming out of TCP (or psock). It is possible, if there is a concurrent call to peek() and recv() that the peek() will move the data from input to rx_list without recv() noticing. recv() will then read data out of order or never wake up. This is not a practical use case/concern, but it makes the self tests less reliable. This patch solves the problem by allowing only one reader in. Because having multiple processes calling read()/peek() is not normal avoid adding a lock and try to fast-path the single reader case. Signed-off-by: NJakub Kicinski <kuba@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 12 7月, 2022 2 次提交
-
-
由 Jakub Kicinski 提交于
As discussed with Maxim add a counter for true NoPad violations. This should help deployments catch unexpected padded records vs just control records which always need re-encryption. https: //lore.kernel.org/all/b111828e6ac34baad9f4e783127eba8344ac252d.camel@nvidia.com/ Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
由 Jakub Kicinski 提交于
MIN -> MIB Fixes: 88527790 ("tls: rx: add sockopt for enabling optimistic decrypt with TLS 1.3") Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 09 7月, 2022 5 次提交
-
-
由 Jakub Kicinski 提交于
tls_wait_data() sets the return code as an output parameter and always returns ctx->recv_pkt on success. Return the error code directly and let the caller read the skb from the context. Use positive return code to indicate ctx->recv_pkt is ready. While touching the definition of the function rename it. Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
由 Jakub Kicinski 提交于
include/net/tls.h is getting a little long, and is probably hard for driver authors to navigate. Split out the internals into a header which will live under net/tls/. While at it move some static inlines with a single user into the source files, add a few tls_ prefixes and fix spelling of 'proccess'. Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
由 Jakub Kicinski 提交于
Jump to the free() call, instead of having to remember to free the memory in multiple places. Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
由 Jakub Kicinski 提交于
The max size of iv + aad + tail is 22B. That's smaller than a single sg entry (32B). Don't bother with the memory packing, just create a struct which holds the max size of those members. Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
由 Jakub Kicinski 提交于
AAD size is either 5 or 13. Really no point complicating the code for the 8B of difference. This will also let us turn the chunked up buffer into a sane struct. Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 06 7月, 2022 5 次提交
-
-
由 Gal Pressman 提交于
This reverts commit 284b4d93. When using TLS device offload and coming from tls_device_reencrypt() flow, -EBADMSG error in tls_do_decryption() should not be counted towards the TLSTlsDecryptError counter. Move the counter increase back to the decrypt_internal() call site in decrypt_skb_update(). This also fixes an issue where: if (n_sgin < 1) return -EBADMSG; Errors in decrypt_internal() were not counted after the cited patch. Fixes: 284b4d93 ("tls: rx: move counting TlsDecryptErrors for sync") Cc: Jakub Kicinski <kuba@kernel.org> Reviewed-by: NMaxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: NTariq Toukan <tariqt@nvidia.com> Signed-off-by: NGal Pressman <gal@nvidia.com> Reviewed-by: NJakub Kicinski <kuba@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jakub Kicinski 提交于
We continuously hold the socket lock during large reads and writes. This may inflate RTT and negatively impact TCP performance. Flush the backlog periodically. I tried to pick a flush period (128kB) which gives significant benefit but the max Bps rate is not yet visibly impacted. Signed-off-by: NJakub Kicinski <kuba@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jakub Kicinski 提交于
Since optimisitic decrypt may add extra load in case of retries require socket owner to explicitly opt-in. Signed-off-by: NJakub Kicinski <kuba@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jakub Kicinski 提交于
We currently don't support decrypt to user buffer with TLS 1.3 because we don't know the record type and how much padding record contains before decryption. In practice data records are by far most common and padding gets used rarely so we can assume data record, no padding, and if we find out that wasn't the case - retry the crypto in place (decrypt to skb). To safeguard from user overwriting content type and padding before we can check it attach a 1B sg entry where last byte of the record will land. Signed-off-by: NJakub Kicinski <kuba@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jakub Kicinski 提交于
To make future patches easier to review make data_len contain the length of the data, without the tail. Signed-off-by: NJakub Kicinski <kuba@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 20 5月, 2022 1 次提交
-
-
由 Jakub Kicinski 提交于
Artem points out that skb may try to take over the skb and queue it to its own list. Unlink the skb before calling out. Fixes: b1a2c178 ("tls: rx: clear ctx->recv_pkt earlier") Reported-by: NArtem Savkov <asavkov@redhat.com> Tested-by: NArtem Savkov <asavkov@redhat.com> Link: https://lore.kernel.org/r/20220518205644.2059468-1-kuba@kernel.orgSigned-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 27 4月, 2022 2 次提交
-
-
由 Jakub Kicinski 提交于
When NIC takes care of crypto (or the record has already been decrypted) we forget to update darg->async. ->async is supposed to mean whether record is async capable on input and whether record has been queued for async crypto on output. Reported-by: NGal Pressman <gal@nvidia.com> Fixes: 3547a1f9 ("tls: rx: use async as an in-out argument") Tested-by: NGal Pressman <gal@nvidia.com> Link: https://lore.kernel.org/r/20220425233309.344858-1-kuba@kernel.orgSigned-off-by: NJakub Kicinski <kuba@kernel.org>
-
由 Eric Dumazet 提交于
Logic added in commit f35f8219 ("tcp: defer skb freeing after socket lock is released") helped bulk TCP flows to move the cost of skbs frees outside of critical section where socket lock was held. But for RPC traffic, or hosts with RFS enabled, the solution is far from being ideal. For RPC traffic, recvmsg() has to return to user space right after skb payload has been consumed, meaning that BH handler has no chance to pick the skb before recvmsg() thread. This issue is more visible with BIG TCP, as more RPC fit one skb. For RFS, even if BH handler picks the skbs, they are still picked from the cpu on which user thread is running. Ideally, it is better to free the skbs (and associated page frags) on the cpu that originally allocated them. This patch removes the per socket anchor (sk->defer_list) and instead uses a per-cpu list, which will hold more skbs per round. This new per-cpu list is drained at the end of net_action_rx(), after incoming packets have been processed, to lower latencies. In normal conditions, skbs are added to the per-cpu list with no further action. In the (unlikely) cases where the cpu does not run net_action_rx() handler fast enough, we use an IPI to raise NET_RX_SOFTIRQ on the remote cpu. Also, we do not bother draining the per-cpu list from dev_cpu_dead() This is because skbs in this list have no requirement on how fast they should be freed. Note that we can add in the future a small per-cpu cache if we see any contention on sd->defer_lock. Tested on a pair of hosts with 100Gbit NIC, RFS enabled, and /proc/sys/net/ipv4/tcp_rmem[2] tuned to 16MB to work around page recycling strategy used by NIC driver (its page pool capacity being too small compared to number of skbs/pages held in sockets receive queues) Note that this tuning was only done to demonstrate worse conditions for skb freeing for this particular test. These conditions can happen in more general production workload. 10 runs of one TCP_STREAM flow Before: Average throughput: 49685 Mbit. Kernel profiles on cpu running user thread recvmsg() show high cost for skb freeing related functions (*) 57.81% [kernel] [k] copy_user_enhanced_fast_string (*) 12.87% [kernel] [k] skb_release_data (*) 4.25% [kernel] [k] __free_one_page (*) 3.57% [kernel] [k] __list_del_entry_valid 1.85% [kernel] [k] __netif_receive_skb_core 1.60% [kernel] [k] __skb_datagram_iter (*) 1.59% [kernel] [k] free_unref_page_commit (*) 1.16% [kernel] [k] __slab_free 1.16% [kernel] [k] _copy_to_iter (*) 1.01% [kernel] [k] kfree (*) 0.88% [kernel] [k] free_unref_page 0.57% [kernel] [k] ip6_rcv_core 0.55% [kernel] [k] ip6t_do_table 0.54% [kernel] [k] flush_smp_call_function_queue (*) 0.54% [kernel] [k] free_pcppages_bulk 0.51% [kernel] [k] llist_reverse_order 0.38% [kernel] [k] process_backlog (*) 0.38% [kernel] [k] free_pcp_prepare 0.37% [kernel] [k] tcp_recvmsg_locked (*) 0.37% [kernel] [k] __list_add_valid 0.34% [kernel] [k] sock_rfree 0.34% [kernel] [k] _raw_spin_lock_irq (*) 0.33% [kernel] [k] __page_cache_release 0.33% [kernel] [k] tcp_v6_rcv (*) 0.33% [kernel] [k] __put_page (*) 0.29% [kernel] [k] __mod_zone_page_state 0.27% [kernel] [k] _raw_spin_lock After patch: Average throughput: 73076 Mbit. Kernel profiles on cpu running user thread recvmsg() looks better: 81.35% [kernel] [k] copy_user_enhanced_fast_string 1.95% [kernel] [k] _copy_to_iter 1.95% [kernel] [k] __skb_datagram_iter 1.27% [kernel] [k] __netif_receive_skb_core 1.03% [kernel] [k] ip6t_do_table 0.60% [kernel] [k] sock_rfree 0.50% [kernel] [k] tcp_v6_rcv 0.47% [kernel] [k] ip6_rcv_core 0.45% [kernel] [k] read_tsc 0.44% [kernel] [k] _raw_spin_lock_irqsave 0.37% [kernel] [k] _raw_spin_lock 0.37% [kernel] [k] native_irq_return_iret 0.33% [kernel] [k] __inet6_lookup_established 0.31% [kernel] [k] ip6_protocol_deliver_rcu 0.29% [kernel] [k] tcp_rcv_established 0.29% [kernel] [k] llist_reverse_order v2: kdoc issue (kernel bots) do not defer if (alloc_cpu == smp_processor_id()) (Paolo) replace the sk_buff_head with a single-linked list (Jakub) add a READ_ONCE()/WRITE_ONCE() for the lockless read of sd->defer_list Signed-off-by: NEric Dumazet <edumazet@google.com> Acked-by: NPaolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/20220422201237.416238-1-eric.dumazet@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 13 4月, 2022 10 次提交
-
-
由 Jakub Kicinski 提交于
TLS 1.3 and ChaChaPoly don't carry IV in the packet. The code before this change would copy out iv_size worth of whatever followed the TLS header in the packet and then for TLS 1.3 | ChaCha overwrite that with the sequence number. Waste of cycles especially with TLS 1.2 being close to dead and TLS 1.3 being the common case. Signed-off-by: NJakub Kicinski <kuba@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jakub Kicinski 提交于
IVs are 8 or 16 bytes, no point reading out the exact value for quantities this small. Signed-off-by: NJakub Kicinski <kuba@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jakub Kicinski 提交于
Propagating EINPROGRESS thru multiple layers of functions is error prone. Use darg->async as an in/out argument, like we use darg->zc today. On input it tells the code if async is allowed, on output if it took place. Signed-off-by: NJakub Kicinski <kuba@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jakub Kicinski 提交于
async crypto handler will report the socket error no need to report it again. We can, however, let the data we already copied be reported to user space but we need to make sure the error will be reported next time around. Signed-off-by: NJakub Kicinski <kuba@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jakub Kicinski 提交于
process_rx_list() only fails if it can't copy data to user space. There is no point recording the error onto sk->sk_err or giving up on the data which was read partially. Treat the return value like a normal socket partial read. Signed-off-by: NJakub Kicinski <kuba@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jakub Kicinski 提交于
If crypto didn't always invoke our callback for async we'd not be clearing skb->sk and would crash in the skb core when freeing it. This if must be dead code. Signed-off-by: NJakub Kicinski <kuba@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jakub Kicinski 提交于
Async crypto never worked with TLS 1.3 and was explicitly disabled in commit 8497ded2 ("net/tls: Disable async decrytion for tls1.3"). There's no need for us to handle TLS 1.3 padding in the async cb. Signed-off-by: NJakub Kicinski <kuba@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jakub Kicinski 提交于
Move counting TlsDecryptErrors to tls_do_decryption() where differences between sync and async crypto are reconciled. No functional changes, this code just always gave me a pause. Signed-off-by: NJakub Kicinski <kuba@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jakub Kicinski 提交于
The code is identical, we can save a few LoC. Signed-off-by: NJakub Kicinski <kuba@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jakub Kicinski 提交于
rx_list is protected by the socket lock, no need to take the built-in spin lock on accesses. Signed-off-by: NJakub Kicinski <kuba@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 12 4月, 2022 1 次提交
-
-
由 Oliver Hartkopp 提交于
The internal recvmsg() functions have two parameters 'flags' and 'noblock' that were merged inside skb_recv_datagram(). As a follow up patch to commit f4b41f06 ("net: remove noblock parameter from skb_recv_datagram()") this patch removes the separate 'noblock' parameter for recvmsg(). Analogue to the referenced patch for skb_recv_datagram() the 'flags' and 'noblock' parameters are unnecessarily split up with e.g. err = sk->sk_prot->recvmsg(sk, msg, size, flags & MSG_DONTWAIT, flags & ~MSG_DONTWAIT, &addr_len); or in err = INDIRECT_CALL_2(sk->sk_prot->recvmsg, tcp_recvmsg, udp_recvmsg, sk, msg, size, flags & MSG_DONTWAIT, flags & ~MSG_DONTWAIT, &addr_len); instead of simply using only flags all the time and check for MSG_DONTWAIT where needed (to preserve for the formerly separated no(n)block condition). Signed-off-by: NOliver Hartkopp <socketcan@hartkopp.net> Link: https://lore.kernel.org/r/20220411124955.154876-1-socketcan@hartkopp.netSigned-off-by: NPaolo Abeni <pabeni@redhat.com>
-
- 11 4月, 2022 8 次提交
-
-
由 Jakub Kicinski 提交于
The current invese logic is harder to follow (and adds extra tests to the fast path). We have to enumerate all cases which need to keep the skb before consuming it. It's simpler to jump out of the full record flow as we detect those cases. This makes it clear that partial consumption and peek can only reach end of the function thru the !zc case so move the code up there. Signed-off-by: NJakub Kicinski <kuba@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jakub Kicinski 提交于
Whatever we do in the loop the skb should not remain on as ctx->recv_pkt afterwards. We can clear that pointer and restart strparser earlier. This adds overhead of extra linking and unlinking to rx_list but that's not large (upcoming change will switch to unlocked skb list operations). Signed-off-by: NJakub Kicinski <kuba@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jakub Kicinski 提交于
tls_sw_advance_skb() always consumes the skb at the end of the loop. To fall here the following must be true: !async && !is_peek && !retain_skb retain_skb => !zc && rxm->full_len > len # but non-full record implies !zc, so above can be simplified as retain_skb => rxm->full_len > len !async && !is_peek && !(rxm->full_len > len) !async && !is_peek && rxm->full_len <= len tls_sw_advance_skb() returns false if len < rxm->full_len which can't be true given conditions above. Signed-off-by: NJakub Kicinski <kuba@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jakub Kicinski 提交于
Most of the conditions deciding if zero-copy can be used do not change throughout the iterations, so pre-calculate them. Signed-off-by: NJakub Kicinski <kuba@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jakub Kicinski 提交于
We track both if the last record was handled by async crypto and how many records were async. This is not necessary. We implicitly assume once crypto goes async it will stay that way, otherwise we'd reorder records. So just track if we're in async mode, the exact number of records is not necessary. This change also forces us into "async" mode more consistently in case crypto ever decided to interleave async and sync. Signed-off-by: NJakub Kicinski <kuba@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jakub Kicinski 提交于
tls_sw_advance_skb() caters to the async case when skb argument is NULL. In that case it simply unpauses the strparser. These are surprising semantics to a person reading the code, and result in higher LoC, so inline the __strp_unpause and only call tls_sw_advance_skb() when we actually move past an skb. Signed-off-by: NJakub Kicinski <kuba@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jakub Kicinski 提交于
cmsg can be filled in during rx_list processing or normal receive. Consolidate the code. We don't need to keep the boolean to track if the cmsg was created. 0 is an invalid content type. Signed-off-by: NJakub Kicinski <kuba@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jakub Kicinski 提交于
Since we are protected from async completions by decrypt_compl_lock we can drop the async_notify and reinit the completion before we start waiting. Signed-off-by: NJakub Kicinski <kuba@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-