1. 02 6月, 2020 1 次提交
    • J
      bpf: Fix running sk_skb program types with ktls · e91de6af
      John Fastabend 提交于
      KTLS uses a stream parser to collect TLS messages and send them to
      the upper layer tls receive handler. This ensures the tls receiver
      has a full TLS header to parse when it is run. However, when a
      socket has BPF_SK_SKB_STREAM_VERDICT program attached before KTLS
      is enabled we end up with two stream parsers running on the same
      socket.
      
      The result is both try to run on the same socket. First the KTLS
      stream parser runs and calls read_sock() which will tcp_read_sock
      which in turn calls tcp_rcv_skb(). This dequeues the skb from the
      sk_receive_queue. When this is done KTLS code then data_ready()
      callback which because we stacked KTLS on top of the bpf stream
      verdict program has been replaced with sk_psock_start_strp(). This
      will in turn kick the stream parser again and eventually do the
      same thing KTLS did above calling into tcp_rcv_skb() and dequeuing
      a skb from the sk_receive_queue.
      
      At this point the data stream is broke. Part of the stream was
      handled by the KTLS side some other bytes may have been handled
      by the BPF side. Generally this results in either missing data
      or more likely a "Bad Message" complaint from the kTLS receive
      handler as the BPF program steals some bytes meant to be in a
      TLS header and/or the TLS header length is no longer correct.
      
      We've already broke the idealized model where we can stack ULPs
      in any order with generic callbacks on the TX side to handle this.
      So in this patch we do the same thing but for RX side. We add
      a sk_psock_strp_enabled() helper so TLS can learn a BPF verdict
      program is running and add a tls_sw_has_ctx_rx() helper so BPF
      side can learn there is a TLS ULP on the socket.
      
      Then on BPF side we omit calling our stream parser to avoid
      breaking the data stream for the KTLS receiver. Then on the
      KTLS side we call BPF_SK_SKB_STREAM_VERDICT once the KTLS
      receiver is done with the packet but before it posts the
      msg to userspace. This gives us symmetry between the TX and
      RX halfs and IMO makes it usable again. On the TX side we
      process packets in this order BPF -> TLS -> TCP and on
      the receive side in the reverse order TCP -> TLS -> BPF.
      
      Discovered while testing OpenSSL 3.0 Alpha2.0 release.
      
      Fixes: d829e9c4 ("tls: convert to generic sk_msg interface")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/159079361946.5745.605854335665044485.stgit@john-Precision-5820-TowerSigned-off-by: NAlexei Starovoitov <ast@kernel.org>
      e91de6af
  2. 28 5月, 2020 1 次提交
  3. 26 5月, 2020 1 次提交
    • V
      net/tls: fix race condition causing kernel panic · 0cada332
      Vinay Kumar Yadav 提交于
      tls_sw_recvmsg() and tls_decrypt_done() can be run concurrently.
      // tls_sw_recvmsg()
      	if (atomic_read(&ctx->decrypt_pending))
      		crypto_wait_req(-EINPROGRESS, &ctx->async_wait);
      	else
      		reinit_completion(&ctx->async_wait.completion);
      
      //tls_decrypt_done()
        	pending = atomic_dec_return(&ctx->decrypt_pending);
      
        	if (!pending && READ_ONCE(ctx->async_notify))
        		complete(&ctx->async_wait.completion);
      
      Consider the scenario tls_decrypt_done() is about to run complete()
      
      	if (!pending && READ_ONCE(ctx->async_notify))
      
      and tls_sw_recvmsg() reads decrypt_pending == 0, does reinit_completion(),
      then tls_decrypt_done() runs complete(). This sequence of execution
      results in wrong completion. Consequently, for next decrypt request,
      it will not wait for completion, eventually on connection close, crypto
      resources freed, there is no way to handle pending decrypt response.
      
      This race condition can be avoided by having atomic_read() mutually
      exclusive with atomic_dec_return(),complete().Intoduced spin lock to
      ensure the mutual exclution.
      
      Addressed similar problem in tx direction.
      
      v1->v2:
      - More readable commit message.
      - Corrected the lock to fix new race scenario.
      - Removed barrier which is not needed now.
      
      Fixes: a42055e8 ("net/tls: Add support for async encryption of records for performance")
      Signed-off-by: NVinay Kumar Yadav <vinay.yadav@chelsio.com>
      Reviewed-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0cada332
  4. 22 5月, 2020 2 次提交
  5. 28 4月, 2020 2 次提交
    • X
      net/tls: Fix sk_psock refcnt leak when in tls_data_ready() · 62b4011f
      Xiyu Yang 提交于
      tls_data_ready() invokes sk_psock_get(), which returns a reference of
      the specified sk_psock object to "psock" with increased refcnt.
      
      When tls_data_ready() returns, local variable "psock" becomes invalid,
      so the refcount should be decreased to keep refcount balanced.
      
      The reference counting issue happens in one exception handling path of
      tls_data_ready(). When "psock->ingress_msg" is empty but "psock" is not
      NULL, the function forgets to decrease the refcnt increased by
      sk_psock_get(), causing a refcnt leak.
      
      Fix this issue by calling sk_psock_put() on all paths when "psock" is
      not NULL.
      Signed-off-by: NXiyu Yang <xiyuyang19@fudan.edu.cn>
      Signed-off-by: NXin Tan <tanxin.ctf@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      62b4011f
    • X
      net/tls: Fix sk_psock refcnt leak in bpf_exec_tx_verdict() · 095f5614
      Xiyu Yang 提交于
      bpf_exec_tx_verdict() invokes sk_psock_get(), which returns a reference
      of the specified sk_psock object to "psock" with increased refcnt.
      
      When bpf_exec_tx_verdict() returns, local variable "psock" becomes
      invalid, so the refcount should be decreased to keep refcount balanced.
      
      The reference counting issue happens in one exception handling path of
      bpf_exec_tx_verdict(). When "policy" equals to NULL but "psock" is not
      NULL, the function forgets to decrease the refcnt increased by
      sk_psock_get(), causing a refcnt leak.
      
      Fix this issue by calling sk_psock_put() on this error path before
      bpf_exec_tx_verdict() returns.
      Signed-off-by: NXiyu Yang <xiyuyang19@fudan.edu.cn>
      Signed-off-by: NXin Tan <tanxin.ctf@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      095f5614
  6. 16 4月, 2020 1 次提交
    • W
      net: tls: Avoid assigning 'const' pointer to non-const pointer · 9a893949
      Will Deacon 提交于
      tls_build_proto() uses WRITE_ONCE() to assign a 'const' pointer to a
      'non-const' pointer. Cleanups to the implementation of WRITE_ONCE() mean
      that this will give rise to a compiler warning, just like a plain old
      assignment would do:
      
        | net/tls/tls_main.c: In function ‘tls_build_proto’:
        | ./include/linux/compiler.h:229:30: warning: assignment discards ‘const’ qualifier from pointer target type [-Wdiscarded-qualifiers]
        | net/tls/tls_main.c:640:4: note: in expansion of macro ‘smp_store_release’
        |   640 |    smp_store_release(&saved_tcpv6_prot, prot);
        |       |    ^~~~~~~~~~~~~~~~~
      
      Drop the const qualifier from the local 'prot' variable, as it isn't
      needed.
      
      Cc: Boris Pismenny <borisp@mellanox.com>
      Cc: Aviad Yehezkel <aviadye@mellanox.com>
      Cc: John Fastabend <john.fastabend@gmail.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NWill Deacon <will@kernel.org>
      9a893949
  7. 09 4月, 2020 1 次提交
    • A
      net/tls: fix const assignment warning · f691a25c
      Arnd Bergmann 提交于
      Building with some experimental patches, I came across a warning
      in the tls code:
      
      include/linux/compiler.h:215:30: warning: assignment discards 'const' qualifier from pointer target type [-Wdiscarded-qualifiers]
        215 |  *(volatile typeof(x) *)&(x) = (val);  \
            |                              ^
      net/tls/tls_main.c:650:4: note: in expansion of macro 'smp_store_release'
        650 |    smp_store_release(&saved_tcpv4_prot, prot);
      
      This appears to be a legitimate warning about assigning a const pointer
      into the non-const 'saved_tcpv4_prot' global. Annotate both the ipv4 and
      ipv6 pointers 'const' to make the code internally consistent.
      
      Fixes: 5bb4c45d ("net/tls: Read sk_prot once when building tls proto ops")
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f691a25c
  8. 22 3月, 2020 3 次提交
  9. 22 2月, 2020 1 次提交
    • J
      net, sk_msg: Annotate lockless access to sk_prot on clone · b8e202d1
      Jakub Sitnicki 提交于
      sk_msg and ULP frameworks override protocol callbacks pointer in
      sk->sk_prot, while tcp accesses it locklessly when cloning the listening
      socket, that is with neither sk_lock nor sk_callback_lock held.
      
      Once we enable use of listening sockets with sockmap (and hence sk_msg),
      there will be shared access to sk->sk_prot if socket is getting cloned
      while being inserted/deleted to/from the sockmap from another CPU:
      
      Read side:
      
      tcp_v4_rcv
        sk = __inet_lookup_skb(...)
        tcp_check_req(sk)
          inet_csk(sk)->icsk_af_ops->syn_recv_sock
            tcp_v4_syn_recv_sock
              tcp_create_openreq_child
                inet_csk_clone_lock
                  sk_clone_lock
                    READ_ONCE(sk->sk_prot)
      
      Write side:
      
      sock_map_ops->map_update_elem
        sock_map_update_elem
          sock_map_update_common
            sock_map_link_no_progs
              tcp_bpf_init
                tcp_bpf_update_sk_prot
                  sk_psock_update_proto
                    WRITE_ONCE(sk->sk_prot, ops)
      
      sock_map_ops->map_delete_elem
        sock_map_delete_elem
          __sock_map_delete
           sock_map_unref
             sk_psock_put
               sk_psock_drop
                 sk_psock_restore_proto
                   tcp_update_ulp
                     WRITE_ONCE(sk->sk_prot, proto)
      
      Mark the shared access with READ_ONCE/WRITE_ONCE annotations.
      Signed-off-by: NJakub Sitnicki <jakub@cloudflare.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20200218171023.844439-2-jakub@cloudflare.com
      b8e202d1
  10. 20 2月, 2020 1 次提交
    • R
      net/tls: Fix to avoid gettig invalid tls record · 06f5201c
      Rohit Maheshwari 提交于
      Current code doesn't check if tcp sequence number is starting from (/after)
      1st record's start sequnce number. It only checks if seq number is before
      1st record's end sequnce number. This problem will always be a possibility
      in re-transmit case. If a record which belongs to a requested seq number is
      already deleted, tls_get_record will start looking into list and as per the
      check it will look if seq number is before the end seq of 1st record, which
      will always be true and will return 1st record always, it should in fact
      return NULL.
      As part of the fix, start looking each record only if the sequence number
      lies in the list else return NULL.
      There is one more check added, driver look for the start marker record to
      handle tcp packets which are before the tls offload start sequence number,
      hence return 1st record if the record is tls start marker and seq number is
      before the 1st record's starting sequence number.
      
      Fixes: e8f69799 ("net/tls: Add generic NIC offload infrastructure")
      Signed-off-by: NRohit Maheshwari <rohitm@chelsio.com>
      Reviewed-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      06f5201c
  11. 16 1月, 2020 4 次提交
    • J
      bpf: Sockmap/tls, fix pop data with SK_DROP return code · 7361d448
      John Fastabend 提交于
      When user returns SK_DROP we need to reset the number of copied bytes
      to indicate to the user the bytes were dropped and not sent. If we
      don't reset the copied arg sendmsg will return as if those bytes were
      copied giving the user a positive return value.
      
      This works as expected today except in the case where the user also
      pops bytes. In the pop case the sg.size is reduced but we don't correctly
      account for this when copied bytes is reset. The popped bytes are not
      accounted for and we return a small positive value potentially confusing
      the user.
      
      The reason this happens is due to a typo where we do the wrong comparison
      when accounting for pop bytes. In this fix notice the if/else is not
      needed and that we have a similar problem if we push data except its not
      visible to the user because if delta is larger the sg.size we return a
      negative value so it appears as an error regardless.
      
      Fixes: 7246d8ed ("bpf: helper to pop data from messages")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NJonathan Lemon <jonathan.lemon@gmail.com>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/bpf/20200111061206.8028-9-john.fastabend@gmail.com
      7361d448
    • J
      bpf: Sockmap/tls, skmsg can have wrapped skmsg that needs extra chaining · 9aaaa568
      John Fastabend 提交于
      Its possible through a set of push, pop, apply helper calls to construct
      a skmsg, which is just a ring of scatterlist elements, with the start
      value larger than the end value. For example,
      
            end       start
        |_0_|_1_| ... |_n_|_n+1_|
      
      Where end points at 1 and start points and n so that valid elements is
      the set {n, n+1, 0, 1}.
      
      Currently, because we don't build the correct chain only {n, n+1} will
      be sent. This adds a check and sg_chain call to correctly submit the
      above to the crypto and tls send path.
      
      Fixes: d3b18ad3 ("tls: add bpf support to sk_msg handling")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NJonathan Lemon <jonathan.lemon@gmail.com>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/bpf/20200111061206.8028-8-john.fastabend@gmail.com
      9aaaa568
    • J
      bpf: Sockmap/tls, tls_sw can create a plaintext buf > encrypt buf · d468e477
      John Fastabend 提交于
      It is possible to build a plaintext buffer using push helper that is larger
      than the allocated encrypt buffer. When this record is pushed to crypto
      layers this can result in a NULL pointer dereference because the crypto
      API expects the encrypt buffer is large enough to fit the plaintext
      buffer. Kernel splat below.
      
      To resolve catch the cases this can happen and split the buffer into two
      records to send individually. Unfortunately, there is still one case to
      handle where the split creates a zero sized buffer. In this case we merge
      the buffers and unmark the split. This happens when apply is zero and user
      pushed data beyond encrypt buffer. This fixes the original case as well
      because the split allocated an encrypt buffer larger than the plaintext
      buffer and the merge simply moves the pointers around so we now have
      a reference to the new (larger) encrypt buffer.
      
      Perhaps its not ideal but it seems the best solution for a fixes branch
      and avoids handling these two cases, (a) apply that needs split and (b)
      non apply case. The are edge cases anyways so optimizing them seems not
      necessary unless someone wants later in next branches.
      
      [  306.719107] BUG: kernel NULL pointer dereference, address: 0000000000000008
      [...]
      [  306.747260] RIP: 0010:scatterwalk_copychunks+0x12f/0x1b0
      [...]
      [  306.770350] Call Trace:
      [  306.770956]  scatterwalk_map_and_copy+0x6c/0x80
      [  306.772026]  gcm_enc_copy_hash+0x4b/0x50
      [  306.772925]  gcm_hash_crypt_remain_continue+0xef/0x110
      [  306.774138]  gcm_hash_crypt_continue+0xa1/0xb0
      [  306.775103]  ? gcm_hash_crypt_continue+0xa1/0xb0
      [  306.776103]  gcm_hash_assoc_remain_continue+0x94/0xa0
      [  306.777170]  gcm_hash_assoc_continue+0x9d/0xb0
      [  306.778239]  gcm_hash_init_continue+0x8f/0xa0
      [  306.779121]  gcm_hash+0x73/0x80
      [  306.779762]  gcm_encrypt_continue+0x6d/0x80
      [  306.780582]  crypto_gcm_encrypt+0xcb/0xe0
      [  306.781474]  crypto_aead_encrypt+0x1f/0x30
      [  306.782353]  tls_push_record+0x3b9/0xb20 [tls]
      [  306.783314]  ? sk_psock_msg_verdict+0x199/0x300
      [  306.784287]  bpf_exec_tx_verdict+0x3f2/0x680 [tls]
      [  306.785357]  tls_sw_sendmsg+0x4a3/0x6a0 [tls]
      
      test_sockmap test signature to trigger bug,
      
      [TEST]: (1, 1, 1, sendmsg, pass,redir,start 1,end 2,pop (1,2),ktls,):
      
      Fixes: d3b18ad3 ("tls: add bpf support to sk_msg handling")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NJonathan Lemon <jonathan.lemon@gmail.com>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/bpf/20200111061206.8028-7-john.fastabend@gmail.com
      d468e477
    • J
      bpf: Sockmap/tls, push write_space updates through ulp updates · 33bfe20d
      John Fastabend 提交于
      When sockmap sock with TLS enabled is removed we cleanup bpf/psock state
      and call tcp_update_ulp() to push updates to TLS ULP on top. However, we
      don't push the write_space callback up and instead simply overwrite the
      op with the psock stored previous op. This may or may not be correct so
      to ensure we don't overwrite the TLS write space hook pass this field to
      the ULP and have it fixup the ctx.
      
      This completes a previous fix that pushed the ops through to the ULP
      but at the time missed doing this for write_space, presumably because
      write_space TLS hook was added around the same time.
      
      Fixes: 95fa1454 ("bpf: sockmap/tls, close can race with map free")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: NJakub Sitnicki <jakub@cloudflare.com>
      Acked-by: NJonathan Lemon <jonathan.lemon@gmail.com>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/bpf/20200111061206.8028-4-john.fastabend@gmail.com
      33bfe20d
  12. 11 1月, 2020 2 次提交
  13. 20 12月, 2019 1 次提交
  14. 07 12月, 2019 1 次提交
  15. 29 11月, 2019 4 次提交
  16. 20 11月, 2019 1 次提交
  17. 16 11月, 2019 1 次提交
  18. 07 11月, 2019 2 次提交
    • J
      net/tls: add a TX lock · 79ffe608
      Jakub Kicinski 提交于
      TLS TX needs to release and re-acquire the socket lock if send buffer
      fills up.
      
      TLS SW TX path currently depends on only allowing one thread to enter
      the function by the abuse of sk_write_pending. If another writer is
      already waiting for memory no new ones are allowed in.
      
      This has two problems:
       - writers don't wake other threads up when they leave the kernel;
         meaning that this scheme works for single extra thread (second
         application thread or delayed work) because memory becoming
         available will send a wake up request, but as Mallesham and
         Pooja report with larger number of threads it leads to threads
         being put to sleep indefinitely;
       - the delayed work does not get _scheduled_ but it may _run_ when
         other writers are present leading to crashes as writers don't
         expect state to change under their feet (same records get pushed
         and freed multiple times); it's hard to reliably bail from the
         work, however, because the mere presence of a writer does not
         guarantee that the writer will push pending records before exiting.
      
      Ensuring wakeups always happen will make the code basically open
      code a mutex. Just use a mutex.
      
      The TLS HW TX path does not have any locking (not even the
      sk_write_pending hack), yet it uses a per-socket sg_tx_data
      array to push records.
      
      Fixes: a42055e8 ("net/tls: Add support for async encryption of records for performance")
      Reported-by: NMallesham  Jatharakonda <mallesh537@gmail.com>
      Reported-by: NPooja Trivedi <poojatrivedi@gmail.com>
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Reviewed-by: NSimon Horman <simon.horman@netronome.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      79ffe608
    • J
      net/tls: don't pay attention to sk_write_pending when pushing partial records · 02b1fa07
      Jakub Kicinski 提交于
      sk_write_pending being not zero does not guarantee that partial
      record will be pushed. If the thread waiting for memory times out
      the pending record may get stuck.
      
      In case of tls_device there is no path where parial record is
      set and writer present in the first place. Partial record is
      set only in tls_push_sg() and tls_push_sg() will return an
      error immediately. All tls_device callers of tls_push_sg()
      will return (and not wait for memory) if it failed.
      
      Fixes: a42055e8 ("net/tls: Add support for async encryption of records for performance")
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Reviewed-by: NSimon Horman <simon.horman@netronome.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      02b1fa07
  19. 07 10月, 2019 5 次提交
  20. 06 10月, 2019 5 次提交