- 27 7月, 2022 5 次提交
-
-
由 Jakub Kicinski 提交于
TLS is a relatively poor fit for strparser. We pause the input every time a message is received, wait for a read which will decrypt the message, start the parser, repeat. strparser is built to delineate the messages, wrap them in individual skbs and let them float off into the stack or a different socket. TLS wants the data pages and nothing else. There's no need for TLS to keep cloning (and occasionally skb_unclone()'ing) the TCP rx queue. This patch uses a pre-allocated skb and attaches the skbs from the TCP rx queue to it as frags. TLS is careful never to modify the input skb without CoW'ing / detaching it first. Since we call TCP rx queue cleanup directly we also get back the benefit of skb deferred free. Overall this results in a 6% gain in my benchmarks. Acked-by: NPaolo Abeni <pabeni@redhat.com> Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
由 Jakub Kicinski 提交于
The non-zero-copy path assumes a full skb with decrypted contents. This means the device offload would have to CoW the data. Try to keep the zero-copy status instead, copy the data to user space. Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
由 Jakub Kicinski 提交于
In the future we'll want to reuse the input skb in case of zero-copy so we shouldn't always free darg.skb. Move the freeing of darg.skb into the non-zc cases. All cases will now free ctx->recv_pkt (inside let tls_rx_rec_done()). Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
由 Jakub Kicinski 提交于
After recent changes the SW side of tls_rx_one_record() can be nicely encapsulated in its own function. Move the pad handling as well. This will be useful for ->zc handling in tls_decrypt_device(). Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
由 Jakub Kicinski 提交于
To allow for the logic to change later wrap accesses which interrogate the input skb in helper functions. Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 22 7月, 2022 1 次提交
-
-
由 Jakub Kicinski 提交于
Eric reports we should release the socket lock if the entire "grab reader lock" operation has failed. The callers assume they don't have to release it or otherwise unwind. Reported-by: NEric Dumazet <edumazet@google.com> Reported-by: syzbot+16e72110feb2b653ef27@syzkaller.appspotmail.com Fixes: 4cbc325e ("tls: rx: allow only one reader at a time") Reviewed-by: NEric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20220720203701.2179034-1-kuba@kernel.orgSigned-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 18 7月, 2022 11 次提交
-
-
由 Jakub Kicinski 提交于
We currently CoW Rx skbs whenever we can't decrypt to a user space buffer. The skbs can be enormous (64kB) and CoW does a linear alloc which has a strong chance of failing under memory pressure. Or even without, skb_cow_data() assumes GFP_ATOMIC. Allocate a new frag'd skb and decrypt into it. We finally take advantage of the decrypted skb getting returned via darg. Signed-off-by: NJakub Kicinski <kuba@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jakub Kicinski 提交于
The "zero-copy" path in SW TLS will engage either for no skbs or for all but last. If the recvmsg parameters are right and the socket can do ZC we'll ZC until the iterator can't fit a full record at which point we'll decrypt one more record and copy over the necessary bits to fill up the request. The only reason we hold onto the ZC skbs which went thru the async path until the end of recvmsg() is to count bytes. We need an accurate count of zc'ed bytes so that we can calculate how much of the non-zc'd data to copy. To allow freeing input skbs on the ZC path count only how much of the list we'll need to consume. Signed-off-by: NJakub Kicinski <kuba@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jakub Kicinski 提交于
Async crypto currently benefits from the fact that we decrypt in place. When we allow input and output to be different skbs we will have to hang onto the input while we move to the next record. Clone the inputs and keep them on a list. Signed-off-by: NJakub Kicinski <kuba@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jakub Kicinski 提交于
Async crypto TLS Rx currently waits for crypto to be done in order to strip the TLS header and tailer. Simplify the code by moving the pointers immediately, since only TLS 1.2 is supported here there is no message padding. This simplifies the decryption into a new skb in the next patch as we don't have to worry about input vs output skb in the decrypt_done() handler any more. Signed-off-by: NJakub Kicinski <kuba@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jakub Kicinski 提交于
Instead of using ctx->recv_pkt after decryption read the skb from darg.skb. This moves the decision of what the "output skb" is to the decrypt handlers. For now after decrypt handler returns successfully ctx->recv_pkt is simply moved to darg.skb, but it will change soon. Note that tls_decrypt_sg() cannot clear the ctx->recv_pkt because it gets called to re-encrypt (i.e. by the device offload). So we need an awkward temporary if() in tls_rx_one_record(). Signed-off-by: NJakub Kicinski <kuba@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jakub Kicinski 提交于
Callers always pass ctx->recv_pkt into decrypt_skb_update(), and it propagates it to its callees. This may give someone the false impression that those functions can accept any valid skb containing a TLS record. That's not the case, the record sequence number is read from the context, and they can only take the next record coming out of the strp. Let the functions get the skb from the context instead of passing it in. This will also make it cleaner to return a different skb than ctx->recv_pkt as the decrypted one later on. Since we're touching the definition of decrypt_skb_update() use this as an opportunity to rename it. Signed-off-by: NJakub Kicinski <kuba@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jakub Kicinski 提交于
I already forgot to transform darg from input to output semantics once on the NIC inline crypto fastpath. To avoid this happening again create a device equivalent of decrypt_internal(). A function responsible for decryption and transforming darg. While at it rename decrypt_internal() to a hopefully slightly more meaningful name. Signed-off-by: NJakub Kicinski <kuba@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jakub Kicinski 提交于
We no longer allow a decrypted skb to remain linked to ctx->recv_pkt. Anything on the list is decrypted, anything on ctx->recv_pkt needs to be decrypted. Signed-off-by: NJakub Kicinski <kuba@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jakub Kicinski 提交于
Detach the skb from ctx->recv_pkt after decryption is done, even if we can't consume it. Signed-off-by: NJakub Kicinski <kuba@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jakub Kicinski 提交于
I thought that having the skb either always on the ctx->rx_list or ctx->recv_pkt will simplify the handling, as we would not have to remember to flip it from one to the other on exit paths. This became a little harder to justify with the fix for BPF sockmaps. Subsequent changes will make the situation even worse. Queue the skbs only when really needed. Signed-off-by: NJakub Kicinski <kuba@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jakub Kicinski 提交于
recvmsg() in TLS gets data from the skb list (rx_list) or fresh skbs we read from TCP via strparser. The former holds skbs which were already decrypted for peek or decrypted and partially consumed. tls_wait_data() only notices appearance of fresh skbs coming out of TCP (or psock). It is possible, if there is a concurrent call to peek() and recv() that the peek() will move the data from input to rx_list without recv() noticing. recv() will then read data out of order or never wake up. This is not a practical use case/concern, but it makes the self tests less reliable. This patch solves the problem by allowing only one reader in. Because having multiple processes calling read()/peek() is not normal avoid adding a lock and try to fast-path the single reader case. Signed-off-by: NJakub Kicinski <kuba@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 12 7月, 2022 2 次提交
-
-
由 Jakub Kicinski 提交于
As discussed with Maxim add a counter for true NoPad violations. This should help deployments catch unexpected padded records vs just control records which always need re-encryption. https: //lore.kernel.org/all/b111828e6ac34baad9f4e783127eba8344ac252d.camel@nvidia.com/ Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
由 Jakub Kicinski 提交于
MIN -> MIB Fixes: 88527790 ("tls: rx: add sockopt for enabling optimistic decrypt with TLS 1.3") Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 09 7月, 2022 5 次提交
-
-
由 Jakub Kicinski 提交于
tls_wait_data() sets the return code as an output parameter and always returns ctx->recv_pkt on success. Return the error code directly and let the caller read the skb from the context. Use positive return code to indicate ctx->recv_pkt is ready. While touching the definition of the function rename it. Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
由 Jakub Kicinski 提交于
include/net/tls.h is getting a little long, and is probably hard for driver authors to navigate. Split out the internals into a header which will live under net/tls/. While at it move some static inlines with a single user into the source files, add a few tls_ prefixes and fix spelling of 'proccess'. Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
由 Jakub Kicinski 提交于
Jump to the free() call, instead of having to remember to free the memory in multiple places. Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
由 Jakub Kicinski 提交于
The max size of iv + aad + tail is 22B. That's smaller than a single sg entry (32B). Don't bother with the memory packing, just create a struct which holds the max size of those members. Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
由 Jakub Kicinski 提交于
AAD size is either 5 or 13. Really no point complicating the code for the 8B of difference. This will also let us turn the chunked up buffer into a sane struct. Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 06 7月, 2022 5 次提交
-
-
由 Gal Pressman 提交于
This reverts commit 284b4d93. When using TLS device offload and coming from tls_device_reencrypt() flow, -EBADMSG error in tls_do_decryption() should not be counted towards the TLSTlsDecryptError counter. Move the counter increase back to the decrypt_internal() call site in decrypt_skb_update(). This also fixes an issue where: if (n_sgin < 1) return -EBADMSG; Errors in decrypt_internal() were not counted after the cited patch. Fixes: 284b4d93 ("tls: rx: move counting TlsDecryptErrors for sync") Cc: Jakub Kicinski <kuba@kernel.org> Reviewed-by: NMaxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: NTariq Toukan <tariqt@nvidia.com> Signed-off-by: NGal Pressman <gal@nvidia.com> Reviewed-by: NJakub Kicinski <kuba@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jakub Kicinski 提交于
We continuously hold the socket lock during large reads and writes. This may inflate RTT and negatively impact TCP performance. Flush the backlog periodically. I tried to pick a flush period (128kB) which gives significant benefit but the max Bps rate is not yet visibly impacted. Signed-off-by: NJakub Kicinski <kuba@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jakub Kicinski 提交于
Since optimisitic decrypt may add extra load in case of retries require socket owner to explicitly opt-in. Signed-off-by: NJakub Kicinski <kuba@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jakub Kicinski 提交于
We currently don't support decrypt to user buffer with TLS 1.3 because we don't know the record type and how much padding record contains before decryption. In practice data records are by far most common and padding gets used rarely so we can assume data record, no padding, and if we find out that wasn't the case - retry the crypto in place (decrypt to skb). To safeguard from user overwriting content type and padding before we can check it attach a 1B sg entry where last byte of the record will land. Signed-off-by: NJakub Kicinski <kuba@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jakub Kicinski 提交于
To make future patches easier to review make data_len contain the length of the data, without the tail. Signed-off-by: NJakub Kicinski <kuba@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 20 5月, 2022 1 次提交
-
-
由 Jakub Kicinski 提交于
Artem points out that skb may try to take over the skb and queue it to its own list. Unlink the skb before calling out. Fixes: b1a2c178 ("tls: rx: clear ctx->recv_pkt earlier") Reported-by: NArtem Savkov <asavkov@redhat.com> Tested-by: NArtem Savkov <asavkov@redhat.com> Link: https://lore.kernel.org/r/20220518205644.2059468-1-kuba@kernel.orgSigned-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 27 4月, 2022 2 次提交
-
-
由 Jakub Kicinski 提交于
When NIC takes care of crypto (or the record has already been decrypted) we forget to update darg->async. ->async is supposed to mean whether record is async capable on input and whether record has been queued for async crypto on output. Reported-by: NGal Pressman <gal@nvidia.com> Fixes: 3547a1f9 ("tls: rx: use async as an in-out argument") Tested-by: NGal Pressman <gal@nvidia.com> Link: https://lore.kernel.org/r/20220425233309.344858-1-kuba@kernel.orgSigned-off-by: NJakub Kicinski <kuba@kernel.org>
-
由 Eric Dumazet 提交于
Logic added in commit f35f8219 ("tcp: defer skb freeing after socket lock is released") helped bulk TCP flows to move the cost of skbs frees outside of critical section where socket lock was held. But for RPC traffic, or hosts with RFS enabled, the solution is far from being ideal. For RPC traffic, recvmsg() has to return to user space right after skb payload has been consumed, meaning that BH handler has no chance to pick the skb before recvmsg() thread. This issue is more visible with BIG TCP, as more RPC fit one skb. For RFS, even if BH handler picks the skbs, they are still picked from the cpu on which user thread is running. Ideally, it is better to free the skbs (and associated page frags) on the cpu that originally allocated them. This patch removes the per socket anchor (sk->defer_list) and instead uses a per-cpu list, which will hold more skbs per round. This new per-cpu list is drained at the end of net_action_rx(), after incoming packets have been processed, to lower latencies. In normal conditions, skbs are added to the per-cpu list with no further action. In the (unlikely) cases where the cpu does not run net_action_rx() handler fast enough, we use an IPI to raise NET_RX_SOFTIRQ on the remote cpu. Also, we do not bother draining the per-cpu list from dev_cpu_dead() This is because skbs in this list have no requirement on how fast they should be freed. Note that we can add in the future a small per-cpu cache if we see any contention on sd->defer_lock. Tested on a pair of hosts with 100Gbit NIC, RFS enabled, and /proc/sys/net/ipv4/tcp_rmem[2] tuned to 16MB to work around page recycling strategy used by NIC driver (its page pool capacity being too small compared to number of skbs/pages held in sockets receive queues) Note that this tuning was only done to demonstrate worse conditions for skb freeing for this particular test. These conditions can happen in more general production workload. 10 runs of one TCP_STREAM flow Before: Average throughput: 49685 Mbit. Kernel profiles on cpu running user thread recvmsg() show high cost for skb freeing related functions (*) 57.81% [kernel] [k] copy_user_enhanced_fast_string (*) 12.87% [kernel] [k] skb_release_data (*) 4.25% [kernel] [k] __free_one_page (*) 3.57% [kernel] [k] __list_del_entry_valid 1.85% [kernel] [k] __netif_receive_skb_core 1.60% [kernel] [k] __skb_datagram_iter (*) 1.59% [kernel] [k] free_unref_page_commit (*) 1.16% [kernel] [k] __slab_free 1.16% [kernel] [k] _copy_to_iter (*) 1.01% [kernel] [k] kfree (*) 0.88% [kernel] [k] free_unref_page 0.57% [kernel] [k] ip6_rcv_core 0.55% [kernel] [k] ip6t_do_table 0.54% [kernel] [k] flush_smp_call_function_queue (*) 0.54% [kernel] [k] free_pcppages_bulk 0.51% [kernel] [k] llist_reverse_order 0.38% [kernel] [k] process_backlog (*) 0.38% [kernel] [k] free_pcp_prepare 0.37% [kernel] [k] tcp_recvmsg_locked (*) 0.37% [kernel] [k] __list_add_valid 0.34% [kernel] [k] sock_rfree 0.34% [kernel] [k] _raw_spin_lock_irq (*) 0.33% [kernel] [k] __page_cache_release 0.33% [kernel] [k] tcp_v6_rcv (*) 0.33% [kernel] [k] __put_page (*) 0.29% [kernel] [k] __mod_zone_page_state 0.27% [kernel] [k] _raw_spin_lock After patch: Average throughput: 73076 Mbit. Kernel profiles on cpu running user thread recvmsg() looks better: 81.35% [kernel] [k] copy_user_enhanced_fast_string 1.95% [kernel] [k] _copy_to_iter 1.95% [kernel] [k] __skb_datagram_iter 1.27% [kernel] [k] __netif_receive_skb_core 1.03% [kernel] [k] ip6t_do_table 0.60% [kernel] [k] sock_rfree 0.50% [kernel] [k] tcp_v6_rcv 0.47% [kernel] [k] ip6_rcv_core 0.45% [kernel] [k] read_tsc 0.44% [kernel] [k] _raw_spin_lock_irqsave 0.37% [kernel] [k] _raw_spin_lock 0.37% [kernel] [k] native_irq_return_iret 0.33% [kernel] [k] __inet6_lookup_established 0.31% [kernel] [k] ip6_protocol_deliver_rcu 0.29% [kernel] [k] tcp_rcv_established 0.29% [kernel] [k] llist_reverse_order v2: kdoc issue (kernel bots) do not defer if (alloc_cpu == smp_processor_id()) (Paolo) replace the sk_buff_head with a single-linked list (Jakub) add a READ_ONCE()/WRITE_ONCE() for the lockless read of sd->defer_list Signed-off-by: NEric Dumazet <edumazet@google.com> Acked-by: NPaolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/20220422201237.416238-1-eric.dumazet@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 13 4月, 2022 8 次提交
-
-
由 Jakub Kicinski 提交于
TLS 1.3 and ChaChaPoly don't carry IV in the packet. The code before this change would copy out iv_size worth of whatever followed the TLS header in the packet and then for TLS 1.3 | ChaCha overwrite that with the sequence number. Waste of cycles especially with TLS 1.2 being close to dead and TLS 1.3 being the common case. Signed-off-by: NJakub Kicinski <kuba@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jakub Kicinski 提交于
IVs are 8 or 16 bytes, no point reading out the exact value for quantities this small. Signed-off-by: NJakub Kicinski <kuba@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jakub Kicinski 提交于
Propagating EINPROGRESS thru multiple layers of functions is error prone. Use darg->async as an in/out argument, like we use darg->zc today. On input it tells the code if async is allowed, on output if it took place. Signed-off-by: NJakub Kicinski <kuba@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jakub Kicinski 提交于
async crypto handler will report the socket error no need to report it again. We can, however, let the data we already copied be reported to user space but we need to make sure the error will be reported next time around. Signed-off-by: NJakub Kicinski <kuba@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jakub Kicinski 提交于
process_rx_list() only fails if it can't copy data to user space. There is no point recording the error onto sk->sk_err or giving up on the data which was read partially. Treat the return value like a normal socket partial read. Signed-off-by: NJakub Kicinski <kuba@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jakub Kicinski 提交于
If crypto didn't always invoke our callback for async we'd not be clearing skb->sk and would crash in the skb core when freeing it. This if must be dead code. Signed-off-by: NJakub Kicinski <kuba@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jakub Kicinski 提交于
Async crypto never worked with TLS 1.3 and was explicitly disabled in commit 8497ded2 ("net/tls: Disable async decrytion for tls1.3"). There's no need for us to handle TLS 1.3 padding in the async cb. Signed-off-by: NJakub Kicinski <kuba@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jakub Kicinski 提交于
Move counting TlsDecryptErrors to tls_do_decryption() where differences between sync and async crypto are reconciled. No functional changes, this code just always gave me a pause. Signed-off-by: NJakub Kicinski <kuba@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-