1. 28 7月, 2021 1 次提交
  2. 30 6月, 2021 1 次提交
  3. 21 6月, 2021 1 次提交
  4. 18 5月, 2021 1 次提交
  5. 12 4月, 2021 1 次提交
  6. 07 4月, 2021 1 次提交
    • J
      bpf, sockmap: Fix sk->prot unhash op reset · 1c84b331
      John Fastabend 提交于
      In '4da6a196' we fixed a potential unhash loop caused when
      a TLS socket in a sockmap was removed from the sockmap. This
      happened because the unhash operation on the TLS ctx continued
      to point at the sockmap implementation of unhash even though the
      psock has already been removed. The sockmap unhash handler when a
      psock is removed does the following,
      
       void sock_map_unhash(struct sock *sk)
       {
      	void (*saved_unhash)(struct sock *sk);
      	struct sk_psock *psock;
      
      	rcu_read_lock();
      	psock = sk_psock(sk);
      	if (unlikely(!psock)) {
      		rcu_read_unlock();
      		if (sk->sk_prot->unhash)
      			sk->sk_prot->unhash(sk);
      		return;
      	}
              [...]
       }
      
      The unlikely() case is there to handle the case where psock is detached
      but the proto ops have not been updated yet. But, in the above case
      with TLS and removed psock we never fixed sk_prot->unhash() and unhash()
      points back to sock_map_unhash resulting in a loop. To fix this we added
      this bit of code,
      
       static inline void sk_psock_restore_proto(struct sock *sk,
                                                struct sk_psock *psock)
       {
             sk->sk_prot->unhash = psock->saved_unhash;
      
      This will set the sk_prot->unhash back to its saved value. This is the
      correct callback for a TLS socket that has been removed from the sock_map.
      Unfortunately, this also overwrites the unhash pointer for all psocks.
      We effectively break sockmap unhash handling for any future socks.
      Omitting the unhash operation will leave stale entries in the map if
      a socket transition through unhash, but does not do close() op.
      
      To fix set unhash correctly before calling into tls_update. This way the
      TLS enabled socket will point to the saved unhash() handler.
      
      Fixes: 4da6a196 ("bpf: Sockmap/tls, during free we may call tcp_bpf_unhash() in loop")
      Reported-by: NCong Wang <xiyou.wangcong@gmail.com>
      Reported-by: NLorenz Bauer <lmb@cloudflare.com>
      Suggested-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/161731441904.68884.15593917809745631972.stgit@john-XPS-13-9370
      1c84b331
  7. 02 4月, 2021 6 次提交
  8. 27 2月, 2021 6 次提交
  9. 28 1月, 2021 1 次提交
  10. 12 10月, 2020 1 次提交
  11. 22 8月, 2020 1 次提交
  12. 01 7月, 2020 1 次提交
  13. 02 6月, 2020 1 次提交
    • J
      bpf: Fix running sk_skb program types with ktls · e91de6af
      John Fastabend 提交于
      KTLS uses a stream parser to collect TLS messages and send them to
      the upper layer tls receive handler. This ensures the tls receiver
      has a full TLS header to parse when it is run. However, when a
      socket has BPF_SK_SKB_STREAM_VERDICT program attached before KTLS
      is enabled we end up with two stream parsers running on the same
      socket.
      
      The result is both try to run on the same socket. First the KTLS
      stream parser runs and calls read_sock() which will tcp_read_sock
      which in turn calls tcp_rcv_skb(). This dequeues the skb from the
      sk_receive_queue. When this is done KTLS code then data_ready()
      callback which because we stacked KTLS on top of the bpf stream
      verdict program has been replaced with sk_psock_start_strp(). This
      will in turn kick the stream parser again and eventually do the
      same thing KTLS did above calling into tcp_rcv_skb() and dequeuing
      a skb from the sk_receive_queue.
      
      At this point the data stream is broke. Part of the stream was
      handled by the KTLS side some other bytes may have been handled
      by the BPF side. Generally this results in either missing data
      or more likely a "Bad Message" complaint from the kTLS receive
      handler as the BPF program steals some bytes meant to be in a
      TLS header and/or the TLS header length is no longer correct.
      
      We've already broke the idealized model where we can stack ULPs
      in any order with generic callbacks on the TX side to handle this.
      So in this patch we do the same thing but for RX side. We add
      a sk_psock_strp_enabled() helper so TLS can learn a BPF verdict
      program is running and add a tls_sw_has_ctx_rx() helper so BPF
      side can learn there is a TLS ULP on the socket.
      
      Then on BPF side we omit calling our stream parser to avoid
      breaking the data stream for the KTLS receiver. Then on the
      KTLS side we call BPF_SK_SKB_STREAM_VERDICT once the KTLS
      receiver is done with the packet but before it posts the
      msg to userspace. This gives us symmetry between the TX and
      RX halfs and IMO makes it usable again. On the TX side we
      process packets in this order BPF -> TLS -> TCP and on
      the receive side in the reverse order TCP -> TLS -> BPF.
      
      Discovered while testing OpenSSL 3.0 Alpha2.0 release.
      
      Fixes: d829e9c4 ("tls: convert to generic sk_msg interface")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/159079361946.5745.605854335665044485.stgit@john-Precision-5820-TowerSigned-off-by: NAlexei Starovoitov <ast@kernel.org>
      e91de6af
  14. 06 5月, 2020 1 次提交
    • J
      bpf, sockmap: bpf_tcp_ingress needs to subtract bytes from sg.size · 81aabbb9
      John Fastabend 提交于
      In bpf_tcp_ingress we used apply_bytes to subtract bytes from sg.size
      which is used to track total bytes in a message. But this is not
      correct because apply_bytes is itself modified in the main loop doing
      the mem_charge.
      
      Then at the end of this we have sg.size incorrectly set and out of
      sync with actual sk values. Then we can get a splat if we try to
      cork the data later and again try to redirect the msg to ingress. To
      fix instead of trying to track msg.size do the easy thing and include
      it as part of the sk_msg_xfer logic so that when the msg is moved the
      sg.size is always correct.
      
      To reproduce the below users will need ingress + cork and hit an
      error path that will then try to 'free' the skmsg.
      
      [  173.699981] BUG: KASAN: null-ptr-deref in sk_msg_free_elem+0xdd/0x120
      [  173.699987] Read of size 8 at addr 0000000000000008 by task test_sockmap/5317
      
      [  173.700000] CPU: 2 PID: 5317 Comm: test_sockmap Tainted: G          I       5.7.0-rc1+ #43
      [  173.700005] Hardware name: Dell Inc. Precision 5820 Tower/002KVM, BIOS 1.9.2 01/24/2019
      [  173.700009] Call Trace:
      [  173.700021]  dump_stack+0x8e/0xcb
      [  173.700029]  ? sk_msg_free_elem+0xdd/0x120
      [  173.700034]  ? sk_msg_free_elem+0xdd/0x120
      [  173.700042]  __kasan_report+0x102/0x15f
      [  173.700052]  ? sk_msg_free_elem+0xdd/0x120
      [  173.700060]  kasan_report+0x32/0x50
      [  173.700070]  sk_msg_free_elem+0xdd/0x120
      [  173.700080]  __sk_msg_free+0x87/0x150
      [  173.700094]  tcp_bpf_send_verdict+0x179/0x4f0
      [  173.700109]  tcp_bpf_sendpage+0x3ce/0x5d0
      
      Fixes: 604326b4 ("bpf, sockmap: convert to generic sk_msg interface")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: NJakub Sitnicki <jakub@cloudflare.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/158861290407.14306.5327773422227552482.stgit@john-Precision-5820-Tower
      81aabbb9
  15. 10 3月, 2020 3 次提交
  16. 22 2月, 2020 1 次提交
    • J
      net, sk_msg: Annotate lockless access to sk_prot on clone · b8e202d1
      Jakub Sitnicki 提交于
      sk_msg and ULP frameworks override protocol callbacks pointer in
      sk->sk_prot, while tcp accesses it locklessly when cloning the listening
      socket, that is with neither sk_lock nor sk_callback_lock held.
      
      Once we enable use of listening sockets with sockmap (and hence sk_msg),
      there will be shared access to sk->sk_prot if socket is getting cloned
      while being inserted/deleted to/from the sockmap from another CPU:
      
      Read side:
      
      tcp_v4_rcv
        sk = __inet_lookup_skb(...)
        tcp_check_req(sk)
          inet_csk(sk)->icsk_af_ops->syn_recv_sock
            tcp_v4_syn_recv_sock
              tcp_create_openreq_child
                inet_csk_clone_lock
                  sk_clone_lock
                    READ_ONCE(sk->sk_prot)
      
      Write side:
      
      sock_map_ops->map_update_elem
        sock_map_update_elem
          sock_map_update_common
            sock_map_link_no_progs
              tcp_bpf_init
                tcp_bpf_update_sk_prot
                  sk_psock_update_proto
                    WRITE_ONCE(sk->sk_prot, ops)
      
      sock_map_ops->map_delete_elem
        sock_map_delete_elem
          __sock_map_delete
           sock_map_unref
             sk_psock_put
               sk_psock_drop
                 sk_psock_restore_proto
                   tcp_update_ulp
                     WRITE_ONCE(sk->sk_prot, proto)
      
      Mark the shared access with READ_ONCE/WRITE_ONCE annotations.
      Signed-off-by: NJakub Sitnicki <jakub@cloudflare.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20200218171023.844439-2-jakub@cloudflare.com
      b8e202d1
  17. 19 2月, 2020 2 次提交
  18. 16 1月, 2020 2 次提交
  19. 29 11月, 2019 1 次提交
    • J
      net: skmsg: fix TLS 1.3 crash with full sk_msg · 031097d9
      Jakub Kicinski 提交于
      TLS 1.3 started using the entry at the end of the SG array
      for chaining-in the single byte content type entry. This mostly
      works:
      
      [ E E E E E E . . ]
        ^           ^
         start       end
      
                       E < content type
                     /
      [ E E E E E E C . ]
        ^           ^
         start       end
      
      (Where E denotes a populated SG entry; C denotes a chaining entry.)
      
      If the array is full, however, the end will point to the start:
      
      [ E E E E E E E E ]
        ^
         start
         end
      
      And we end up overwriting the start:
      
          E < content type
         /
      [ C E E E E E E E ]
        ^
         start
         end
      
      The sg array is supposed to be a circular buffer with start and
      end markers pointing anywhere. In case where start > end
      (i.e. the circular buffer has "wrapped") there is an extra entry
      reserved at the end to chain the two halves together.
      
      [ E E E E E E . . l ]
      
      (Where l is the reserved entry for "looping" back to front.
      
      As suggested by John, let's reserve another entry for chaining
      SG entries after the main circular buffer. Note that this entry
      has to be pointed to by the end entry so its position is not fixed.
      
      Examples of full messages:
      
      [ E E E E E E E E . l ]
        ^               ^
         start           end
      
         <---------------.
      [ E E . E E E E E E l ]
            ^ ^
         end   start
      
      Now the end will always point to an unused entry, so TLS 1.3
      can always use it.
      
      Fixes: 130b392c ("net: tls: Add tls 1.3 support")
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Reviewed-by: NSimon Horman <simon.horman@netronome.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      031097d9
  20. 06 11月, 2019 1 次提交
  21. 07 10月, 2019 1 次提交
  22. 22 7月, 2019 1 次提交
    • J
      bpf: sockmap/tls, close can race with map free · 95fa1454
      John Fastabend 提交于
      When a map free is called and in parallel a socket is closed we
      have two paths that can potentially reset the socket prot ops, the
      bpf close() path and the map free path. This creates a problem
      with which prot ops should be used from the socket closed side.
      
      If the map_free side completes first then we want to call the
      original lowest level ops. However, if the tls path runs first
      we want to call the sockmap ops. Additionally there was no locking
      around prot updates in TLS code paths so the prot ops could
      be changed multiple times once from TLS path and again from sockmap
      side potentially leaving ops pointed at either TLS or sockmap
      when psock and/or tls context have already been destroyed.
      
      To fix this race first only update ops inside callback lock
      so that TLS, sockmap and lowest level all agree on prot state.
      Second and a ULP callback update() so that lower layers can
      inform the upper layer when they are being removed allowing the
      upper layer to reset prot ops.
      
      This gets us close to allowing sockmap and tls to be stacked
      in arbitrary order but will save that patch for *next trees.
      
      v4:
       - make sure we don't free things for device;
       - remove the checks which swap the callbacks back
         only if TLS is at the top.
      
      Reported-by: syzbot+06537213db7ba2745c4a@syzkaller.appspotmail.com
      Fixes: 02c558b2 ("bpf: sockmap, support for msg_peek in sk_msg with redirect ingress")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Reviewed-by: NDirk van der Merwe <dirk.vandermerwe@netronome.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      95fa1454
  23. 23 5月, 2019 1 次提交
    • J
      bpf: sockmap, restore sk_write_space when psock gets dropped · 186bcc3d
      Jakub Sitnicki 提交于
      Once psock gets unlinked from its sock (sk_psock_drop), user-space can
      still trigger a call to sk->sk_write_space by setting TCP_NOTSENT_LOWAT
      socket option. This causes a null-ptr-deref because we try to read
      psock->saved_write_space from sk_psock_write_space:
      
      ==================================================================
      BUG: KASAN: null-ptr-deref in sk_psock_write_space+0x69/0x80
      Read of size 8 at addr 00000000000001a0 by task sockmap-echo/131
      
      CPU: 0 PID: 131 Comm: sockmap-echo Not tainted 5.2.0-rc1-00094-gf49aa1de #81
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
      ?-20180724_192412-buildhw-07.phx2.fedoraproject.org-1.fc29 04/01/2014
      Call Trace:
       ? sk_psock_write_space+0x69/0x80
       __kasan_report.cold.2+0x5/0x3f
       ? sk_psock_write_space+0x69/0x80
       kasan_report+0xe/0x20
       sk_psock_write_space+0x69/0x80
       tcp_setsockopt+0x69a/0xfc0
       ? tcp_shutdown+0x70/0x70
       ? fsnotify+0x5b0/0x5f0
       ? remove_wait_queue+0x90/0x90
       ? __fget_light+0xa5/0xf0
       __sys_setsockopt+0xe6/0x180
       ? sockfd_lookup_light+0xb0/0xb0
       ? vfs_write+0x195/0x210
       ? ksys_write+0xc9/0x150
       ? __x64_sys_read+0x50/0x50
       ? __bpf_trace_x86_fpu+0x10/0x10
       __x64_sys_setsockopt+0x61/0x70
       do_syscall_64+0xc5/0x520
       ? vmacache_find+0xc0/0x110
       ? syscall_return_slowpath+0x110/0x110
       ? handle_mm_fault+0xb4/0x110
       ? entry_SYSCALL_64_after_hwframe+0x3e/0xbe
       ? trace_hardirqs_off_caller+0x4b/0x120
       ? trace_hardirqs_off_thunk+0x1a/0x3a
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x7f2e5e7cdcce
      Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b1 66 2e 0f 1f 84 00 00 00 00 00
      0f 1f 44 00 00 f3 0f 1e fa 49 89 ca b8 36 00 00 00 0f 05 <48> 3d 01 f0 ff
      ff 73 01 c3 48 8b 0d 8a 11 0c 00 f7 d8 64 89 01 48
      RSP: 002b:00007ffed011b778 EFLAGS: 00000206 ORIG_RAX: 0000000000000036
      RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f2e5e7cdcce
      RDX: 0000000000000019 RSI: 0000000000000006 RDI: 0000000000000007
      RBP: 00007ffed011b790 R08: 0000000000000004 R09: 00007f2e5e84ee80
      R10: 00007ffed011b788 R11: 0000000000000206 R12: 00007ffed011b78c
      R13: 00007ffed011b788 R14: 0000000000000007 R15: 0000000000000068
      ==================================================================
      
      Restore the saved sk_write_space callback when psock is being dropped to
      fix the crash.
      Signed-off-by: NJakub Sitnicki <jakub@cloudflare.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      186bcc3d
  24. 21 12月, 2018 2 次提交
  25. 19 12月, 2018 1 次提交
    • J
      bpf: sockmap, metadata support for reporting size of msg · 3bdbd022
      John Fastabend 提交于
      This adds metadata to sk_msg_md for BPF programs to read the sk_msg
      size.
      
      When the SK_MSG program is running under an application that is using
      sendfile the data is not copied into sk_msg buffers by default. Rather
      the BPF program uses sk_msg_pull_data to read the bytes in. This
      avoids doing the costly memcopy instructions when they are not in
      fact needed. However, if we don't know the size of the sk_msg we
      have to guess if needed bytes are available by doing a pull request
      which may fail. By including the size of the sk_msg BPF programs can
      check the size before issuing sk_msg_pull_data requests.
      
      Additionally, the same applies for sendmsg calls when the application
      provides multiple iovs. Here the BPF program needs to pull in data
      to update data pointers but its not clear where the data ends without
      a size parameter. In many cases "guessing" is not easy to do
      and results in multiple calls to pull and without bounded loops
      everything gets fairly tricky.
      
      Clean this up by including a u32 size field. Note, all writes into
      sk_msg_md are rejected already from sk_msg_is_valid_access so nothing
      additional is needed there.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      3bdbd022