1. 21 6月, 2021 1 次提交
  2. 07 4月, 2021 1 次提交
    • J
      bpf, sockmap: Fix incorrect fwd_alloc accounting · 144748eb
      John Fastabend 提交于
      Incorrect accounting fwd_alloc can result in a warning when the socket
      is torn down,
      
       [18455.319240] WARNING: CPU: 0 PID: 24075 at net/core/stream.c:208 sk_stream_kill_queues+0x21f/0x230
       [...]
       [18455.319543] Call Trace:
       [18455.319556]  inet_csk_destroy_sock+0xba/0x1f0
       [18455.319577]  tcp_rcv_state_process+0x1b4e/0x2380
       [18455.319593]  ? lock_downgrade+0x3a0/0x3a0
       [18455.319617]  ? tcp_finish_connect+0x1e0/0x1e0
       [18455.319631]  ? sk_reset_timer+0x15/0x70
       [18455.319646]  ? tcp_schedule_loss_probe+0x1b2/0x240
       [18455.319663]  ? lock_release+0xb2/0x3f0
       [18455.319676]  ? __release_sock+0x8a/0x1b0
       [18455.319690]  ? lock_downgrade+0x3a0/0x3a0
       [18455.319704]  ? lock_release+0x3f0/0x3f0
       [18455.319717]  ? __tcp_close+0x2c6/0x790
       [18455.319736]  ? tcp_v4_do_rcv+0x168/0x370
       [18455.319750]  tcp_v4_do_rcv+0x168/0x370
       [18455.319767]  __release_sock+0xbc/0x1b0
       [18455.319785]  __tcp_close+0x2ee/0x790
       [18455.319805]  tcp_close+0x20/0x80
      
      This currently happens because on redirect case we do skb_set_owner_r()
      with the original sock. This increments the fwd_alloc memory accounting
      on the original sock. Then on redirect we may push this into the queue
      of the psock we are redirecting to. When the skb is flushed from the
      queue we give the memory back to the original sock. The problem is if
      the original sock is destroyed/closed with skbs on another psocks queue
      then the original sock will not have a way to reclaim the memory before
      being destroyed. Then above warning will be thrown
      
        sockA                          sockB
      
        sk_psock_strp_read()
         sk_psock_verdict_apply()
           -- SK_REDIRECT --
           sk_psock_skb_redirect()
                                      skb_queue_tail(psock_other->ingress_skb..)
      
        sk_close()
         sock_map_unref()
           sk_psock_put()
             sk_psock_drop()
               sk_psock_zap_ingress()
      
      At this point we have torn down our own psock, but have the outstanding
      skb in psock_other. Note that SK_PASS doesn't have this problem because
      the sk_psock_drop() logic releases the skb, its still associated with
      our psock.
      
      To resolve lets only account for sockets on the ingress queue that are
      still associated with the current socket. On the redirect case we will
      check memory limits per 6fa9201a, but will omit fwd_alloc accounting
      until skb is actually enqueued. When the skb is sent via skb_send_sock_locked
      or received with sk_psock_skb_ingress memory will be claimed on psock_other.
      
      Fixes: 6fa9201a ("bpf, sockmap: Avoid returning unneeded EAGAIN when redirecting to self")
      Reported-by: NAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/161731444013.68884.4021114312848535993.stgit@john-XPS-13-9370
      144748eb
  3. 02 4月, 2021 8 次提交
  4. 27 2月, 2021 7 次提交
  5. 28 1月, 2021 1 次提交
  6. 18 11月, 2020 5 次提交
  7. 12 10月, 2020 7 次提交
  8. 24 8月, 2020 1 次提交
  9. 22 8月, 2020 1 次提交
  10. 28 6月, 2020 2 次提交
    • J
      bpf, sockmap: RCU dereferenced psock may be used outside RCU block · 8025751d
      John Fastabend 提交于
      If an ingress verdict program specifies message sizes greater than
      skb->len and there is an ENOMEM error due to memory pressure we
      may call the rcv_msg handler outside the strp_data_ready() caller
      context. This is because on an ENOMEM error the strparser will
      retry from a workqueue. The caller currently protects the use of
      psock by calling the strp_data_ready() inside a rcu_read_lock/unlock
      block.
      
      But, in above workqueue error case the psock is accessed outside
      the read_lock/unlock block of the caller. So instead of using
      psock directly we must do a look up against the sk again to
      ensure the psock is available.
      
      There is an an ugly piece here where we must handle
      the case where we paused the strp and removed the psock. On
      psock removal we first pause the strparser and then remove
      the psock. If the strparser is paused while an skb is
      scheduled on the workqueue the skb will be dropped on the
      flow and kfree_skb() is called. If the workqueue manages
      to get called before we pause the strparser but runs the rcvmsg
      callback after the psock is removed we will hit the unlikely
      case where we run the sockmap rcvmsg handler but do not have
      a psock. For now we will follow strparser logic and drop the
      skb on the floor with skb_kfree(). This is ugly because the
      data is dropped. To date this has not caused problems in practice
      because either the application controlling the sockmap is
      coordinating with the datapath so that skbs are "flushed"
      before removal or we simply wait for the sock to be closed before
      removing it.
      
      This patch fixes the describe RCU bug and dropping the skb doesn't
      make things worse. Future patches will improve this by allowing
      the normal case where skbs are not merged to skip the strparser
      altogether. In practice many (most?) use cases have no need to
      merge skbs so its both a code complexity hit as seen above and
      a performance issue. For example, in the Cilium case we always
      set the strparser up to return sbks 1:1 without any merging and
      have avoided above issues.
      
      Fixes: e91de6af ("bpf: Fix running sk_skb program types with ktls")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/159312679888.18340.15248924071966273998.stgit@john-XPS-13-9370
      8025751d
    • J
      bpf, sockmap: RCU splat with redirect and strparser error or TLS · 93dd5f18
      John Fastabend 提交于
      There are two paths to generate the below RCU splat the first and
      most obvious is the result of the BPF verdict program issuing a
      redirect on a TLS socket (This is the splat shown below). Unlike
      the non-TLS case the caller of the *strp_read() hooks does not
      wrap the call in a rcu_read_lock/unlock. Then if the BPF program
      issues a redirect action we hit the RCU splat.
      
      However, in the non-TLS socket case the splat appears to be
      relatively rare, because the skmsg caller into the strp_data_ready()
      is wrapped in a rcu_read_lock/unlock. Shown here,
      
       static void sk_psock_strp_data_ready(struct sock *sk)
       {
      	struct sk_psock *psock;
      
      	rcu_read_lock();
      	psock = sk_psock(sk);
      	if (likely(psock)) {
      		if (tls_sw_has_ctx_rx(sk)) {
      			psock->parser.saved_data_ready(sk);
      		} else {
      			write_lock_bh(&sk->sk_callback_lock);
      			strp_data_ready(&psock->parser.strp);
      			write_unlock_bh(&sk->sk_callback_lock);
      		}
      	}
      	rcu_read_unlock();
       }
      
      If the above was the only way to run the verdict program we
      would be safe. But, there is a case where the strparser may throw an
      ENOMEM error while parsing the skb. This is a result of a failed
      skb_clone, or alloc_skb_for_msg while building a new merged skb when
      the msg length needed spans multiple skbs. This will in turn put the
      skb on the strp_wrk workqueue in the strparser code. The skb will
      later be dequeued and verdict programs run, but now from a
      different context without the rcu_read_lock()/unlock() critical
      section in sk_psock_strp_data_ready() shown above. In practice
      I have not seen this yet, because as far as I know most users of the
      verdict programs are also only working on single skbs. In this case no
      merge happens which could trigger the above ENOMEM errors. In addition
      the system would need to be under memory pressure. For example, we
      can't hit the above case in selftests because we missed having tests
      to merge skbs. (Added in later patch)
      
      To fix the below splat extend the rcu_read_lock/unnlock block to
      include the call to sk_psock_tls_verdict_apply(). This will fix both
      TLS redirect case and non-TLS redirect+error case. Also remove
      psock from the sk_psock_tls_verdict_apply() function signature its
      not used there.
      
      [ 1095.937597] WARNING: suspicious RCU usage
      [ 1095.940964] 5.7.0-rc7-02911-g463bac5f #1 Tainted: G        W
      [ 1095.944363] -----------------------------
      [ 1095.947384] include/linux/skmsg.h:284 suspicious rcu_dereference_check() usage!
      [ 1095.950866]
      [ 1095.950866] other info that might help us debug this:
      [ 1095.950866]
      [ 1095.957146]
      [ 1095.957146] rcu_scheduler_active = 2, debug_locks = 1
      [ 1095.961482] 1 lock held by test_sockmap/15970:
      [ 1095.964501]  #0: ffff9ea6b25de660 (sk_lock-AF_INET){+.+.}-{0:0}, at: tls_sw_recvmsg+0x13a/0x840 [tls]
      [ 1095.968568]
      [ 1095.968568] stack backtrace:
      [ 1095.975001] CPU: 1 PID: 15970 Comm: test_sockmap Tainted: G        W         5.7.0-rc7-02911-g463bac5f #1
      [ 1095.977883] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
      [ 1095.980519] Call Trace:
      [ 1095.982191]  dump_stack+0x8f/0xd0
      [ 1095.984040]  sk_psock_skb_redirect+0xa6/0xf0
      [ 1095.986073]  sk_psock_tls_strp_read+0x1d8/0x250
      [ 1095.988095]  tls_sw_recvmsg+0x714/0x840 [tls]
      
      v2: Improve commit message to identify non-TLS redirect plus error case
          condition as well as more common TLS case. In the process I decided
          doing the rcu_read_unlock followed by the lock/unlock inside branches
          was unnecessarily complex. We can just extend the current rcu block
          and get the same effeective without the shuffling and branching.
          Thanks Martin!
      
      Fixes: e91de6af ("bpf: Fix running sk_skb program types with ktls")
      Reported-by: NJakub Sitnicki <jakub@cloudflare.com>
      Reported-by: Nkernel test robot <rong.a.chen@intel.com>
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NJakub Sitnicki <jakub@cloudflare.com>
      Link: https://lore.kernel.org/bpf/159312677907.18340.11064813152758406626.stgit@john-XPS-13-9370
      93dd5f18
  11. 02 6月, 2020 2 次提交
    • J
      bpf: Fix running sk_skb program types with ktls · e91de6af
      John Fastabend 提交于
      KTLS uses a stream parser to collect TLS messages and send them to
      the upper layer tls receive handler. This ensures the tls receiver
      has a full TLS header to parse when it is run. However, when a
      socket has BPF_SK_SKB_STREAM_VERDICT program attached before KTLS
      is enabled we end up with two stream parsers running on the same
      socket.
      
      The result is both try to run on the same socket. First the KTLS
      stream parser runs and calls read_sock() which will tcp_read_sock
      which in turn calls tcp_rcv_skb(). This dequeues the skb from the
      sk_receive_queue. When this is done KTLS code then data_ready()
      callback which because we stacked KTLS on top of the bpf stream
      verdict program has been replaced with sk_psock_start_strp(). This
      will in turn kick the stream parser again and eventually do the
      same thing KTLS did above calling into tcp_rcv_skb() and dequeuing
      a skb from the sk_receive_queue.
      
      At this point the data stream is broke. Part of the stream was
      handled by the KTLS side some other bytes may have been handled
      by the BPF side. Generally this results in either missing data
      or more likely a "Bad Message" complaint from the kTLS receive
      handler as the BPF program steals some bytes meant to be in a
      TLS header and/or the TLS header length is no longer correct.
      
      We've already broke the idealized model where we can stack ULPs
      in any order with generic callbacks on the TX side to handle this.
      So in this patch we do the same thing but for RX side. We add
      a sk_psock_strp_enabled() helper so TLS can learn a BPF verdict
      program is running and add a tls_sw_has_ctx_rx() helper so BPF
      side can learn there is a TLS ULP on the socket.
      
      Then on BPF side we omit calling our stream parser to avoid
      breaking the data stream for the KTLS receiver. Then on the
      KTLS side we call BPF_SK_SKB_STREAM_VERDICT once the KTLS
      receiver is done with the packet but before it posts the
      msg to userspace. This gives us symmetry between the TX and
      RX halfs and IMO makes it usable again. On the TX side we
      process packets in this order BPF -> TLS -> TCP and on
      the receive side in the reverse order TCP -> TLS -> BPF.
      
      Discovered while testing OpenSSL 3.0 Alpha2.0 release.
      
      Fixes: d829e9c4 ("tls: convert to generic sk_msg interface")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/159079361946.5745.605854335665044485.stgit@john-Precision-5820-TowerSigned-off-by: NAlexei Starovoitov <ast@kernel.org>
      e91de6af
    • J
      bpf: Refactor sockmap redirect code so its easy to reuse · ca2f5f21
      John Fastabend 提交于
      We will need this block of code called from tls context shortly
      lets refactor the redirect logic so its easy to use. This also
      cleans up the switch stmt so we have fewer fallthrough cases.
      
      No logic changes are intended.
      
      Fixes: d829e9c4 ("tls: convert to generic sk_msg interface")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Reviewed-by: NJakub Sitnicki <jakub@cloudflare.com>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Link: https://lore.kernel.org/bpf/159079360110.5745.7024009076049029819.stgit@john-Precision-5820-TowerSigned-off-by: NAlexei Starovoitov <ast@kernel.org>
      ca2f5f21
  12. 25 2月, 2020 1 次提交
  13. 22 2月, 2020 1 次提交
    • J
      net, sk_msg: Clear sk_user_data pointer on clone if tagged · f1ff5ce2
      Jakub Sitnicki 提交于
      sk_user_data can hold a pointer to an object that is not intended to be
      shared between the parent socket and the child that gets a pointer copy on
      clone. This is the case when sk_user_data points at reference-counted
      object, like struct sk_psock.
      
      One way to resolve it is to tag the pointer with a no-copy flag by
      repurposing its lowest bit. Based on the bit-flag value we clear the child
      sk_user_data pointer after cloning the parent socket.
      
      The no-copy flag is stored in the pointer itself as opposed to externally,
      say in socket flags, to guarantee that the pointer and the flag are copied
      from parent to child socket in an atomic fashion. Parent socket state is
      subject to change while copying, we don't hold any locks at that time.
      
      This approach relies on an assumption that sk_user_data holds a pointer to
      an object aligned at least 2 bytes. A manual audit of existing users of
      rcu_dereference_sk_user_data helper confirms our assumption.
      
      Also, an RCU-protected sk_user_data is not likely to hold a pointer to a
      char value or a pathological case of "struct { char c; }". To be safe, warn
      when the flag-bit is set when setting sk_user_data to catch any future
      misuses.
      
      It is worth considering why clearing sk_user_data unconditionally is not an
      option. There exist users, DRBD, NVMe, and Xen drivers being among them,
      that rely on the pointer being copied when cloning the listening socket.
      
      Potentially we could distinguish these users by checking if the listening
      socket has been created in kernel-space via sock_create_kern, and hence has
      sk_kern_sock flag set. However, this is not the case for NVMe and Xen
      drivers, which create sockets without marking them as belonging to the
      kernel.
      Signed-off-by: NJakub Sitnicki <jakub@cloudflare.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/20200218171023.844439-3-jakub@cloudflare.com
      f1ff5ce2
  14. 23 1月, 2020 1 次提交
  15. 16 1月, 2020 1 次提交
    • J
      bpf: Sockmap, ensure sock lock held during tear down · 7e81a353
      John Fastabend 提交于
      The sock_map_free() and sock_hash_free() paths used to delete sockmap
      and sockhash maps walk the maps and destroy psock and bpf state associated
      with the socks in the map. When done the socks no longer have BPF programs
      attached and will function normally. This can happen while the socks in
      the map are still "live" meaning data may be sent/received during the walk.
      
      Currently, though we don't take the sock_lock when the psock and bpf state
      is removed through this path. Specifically, this means we can be writing
      into the ops structure pointers such as sendmsg, sendpage, recvmsg, etc.
      while they are also being called from the networking side. This is not
      safe, we never used proper READ_ONCE/WRITE_ONCE semantics here if we
      believed it was safe. Further its not clear to me its even a good idea
      to try and do this on "live" sockets while networking side might also
      be using the socket. Instead of trying to reason about using the socks
      from both sides lets realize that every use case I'm aware of rarely
      deletes maps, in fact kubernetes/Cilium case builds map at init and
      never tears it down except on errors. So lets do the simple fix and
      grab sock lock.
      
      This patch wraps sock deletes from maps in sock lock and adds some
      annotations so we catch any other cases easier.
      
      Fixes: 604326b4 ("bpf, sockmap: convert to generic sk_msg interface")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/bpf/20200111061206.8028-3-john.fastabend@gmail.com
      7e81a353