1. 28 1月, 2021 1 次提交
  2. 18 11月, 2020 5 次提交
  3. 12 10月, 2020 7 次提交
  4. 24 8月, 2020 1 次提交
  5. 22 8月, 2020 1 次提交
  6. 28 6月, 2020 2 次提交
    • J
      bpf, sockmap: RCU dereferenced psock may be used outside RCU block · 8025751d
      John Fastabend 提交于
      If an ingress verdict program specifies message sizes greater than
      skb->len and there is an ENOMEM error due to memory pressure we
      may call the rcv_msg handler outside the strp_data_ready() caller
      context. This is because on an ENOMEM error the strparser will
      retry from a workqueue. The caller currently protects the use of
      psock by calling the strp_data_ready() inside a rcu_read_lock/unlock
      block.
      
      But, in above workqueue error case the psock is accessed outside
      the read_lock/unlock block of the caller. So instead of using
      psock directly we must do a look up against the sk again to
      ensure the psock is available.
      
      There is an an ugly piece here where we must handle
      the case where we paused the strp and removed the psock. On
      psock removal we first pause the strparser and then remove
      the psock. If the strparser is paused while an skb is
      scheduled on the workqueue the skb will be dropped on the
      flow and kfree_skb() is called. If the workqueue manages
      to get called before we pause the strparser but runs the rcvmsg
      callback after the psock is removed we will hit the unlikely
      case where we run the sockmap rcvmsg handler but do not have
      a psock. For now we will follow strparser logic and drop the
      skb on the floor with skb_kfree(). This is ugly because the
      data is dropped. To date this has not caused problems in practice
      because either the application controlling the sockmap is
      coordinating with the datapath so that skbs are "flushed"
      before removal or we simply wait for the sock to be closed before
      removing it.
      
      This patch fixes the describe RCU bug and dropping the skb doesn't
      make things worse. Future patches will improve this by allowing
      the normal case where skbs are not merged to skip the strparser
      altogether. In practice many (most?) use cases have no need to
      merge skbs so its both a code complexity hit as seen above and
      a performance issue. For example, in the Cilium case we always
      set the strparser up to return sbks 1:1 without any merging and
      have avoided above issues.
      
      Fixes: e91de6af ("bpf: Fix running sk_skb program types with ktls")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/159312679888.18340.15248924071966273998.stgit@john-XPS-13-9370
      8025751d
    • J
      bpf, sockmap: RCU splat with redirect and strparser error or TLS · 93dd5f18
      John Fastabend 提交于
      There are two paths to generate the below RCU splat the first and
      most obvious is the result of the BPF verdict program issuing a
      redirect on a TLS socket (This is the splat shown below). Unlike
      the non-TLS case the caller of the *strp_read() hooks does not
      wrap the call in a rcu_read_lock/unlock. Then if the BPF program
      issues a redirect action we hit the RCU splat.
      
      However, in the non-TLS socket case the splat appears to be
      relatively rare, because the skmsg caller into the strp_data_ready()
      is wrapped in a rcu_read_lock/unlock. Shown here,
      
       static void sk_psock_strp_data_ready(struct sock *sk)
       {
      	struct sk_psock *psock;
      
      	rcu_read_lock();
      	psock = sk_psock(sk);
      	if (likely(psock)) {
      		if (tls_sw_has_ctx_rx(sk)) {
      			psock->parser.saved_data_ready(sk);
      		} else {
      			write_lock_bh(&sk->sk_callback_lock);
      			strp_data_ready(&psock->parser.strp);
      			write_unlock_bh(&sk->sk_callback_lock);
      		}
      	}
      	rcu_read_unlock();
       }
      
      If the above was the only way to run the verdict program we
      would be safe. But, there is a case where the strparser may throw an
      ENOMEM error while parsing the skb. This is a result of a failed
      skb_clone, or alloc_skb_for_msg while building a new merged skb when
      the msg length needed spans multiple skbs. This will in turn put the
      skb on the strp_wrk workqueue in the strparser code. The skb will
      later be dequeued and verdict programs run, but now from a
      different context without the rcu_read_lock()/unlock() critical
      section in sk_psock_strp_data_ready() shown above. In practice
      I have not seen this yet, because as far as I know most users of the
      verdict programs are also only working on single skbs. In this case no
      merge happens which could trigger the above ENOMEM errors. In addition
      the system would need to be under memory pressure. For example, we
      can't hit the above case in selftests because we missed having tests
      to merge skbs. (Added in later patch)
      
      To fix the below splat extend the rcu_read_lock/unnlock block to
      include the call to sk_psock_tls_verdict_apply(). This will fix both
      TLS redirect case and non-TLS redirect+error case. Also remove
      psock from the sk_psock_tls_verdict_apply() function signature its
      not used there.
      
      [ 1095.937597] WARNING: suspicious RCU usage
      [ 1095.940964] 5.7.0-rc7-02911-g463bac5f #1 Tainted: G        W
      [ 1095.944363] -----------------------------
      [ 1095.947384] include/linux/skmsg.h:284 suspicious rcu_dereference_check() usage!
      [ 1095.950866]
      [ 1095.950866] other info that might help us debug this:
      [ 1095.950866]
      [ 1095.957146]
      [ 1095.957146] rcu_scheduler_active = 2, debug_locks = 1
      [ 1095.961482] 1 lock held by test_sockmap/15970:
      [ 1095.964501]  #0: ffff9ea6b25de660 (sk_lock-AF_INET){+.+.}-{0:0}, at: tls_sw_recvmsg+0x13a/0x840 [tls]
      [ 1095.968568]
      [ 1095.968568] stack backtrace:
      [ 1095.975001] CPU: 1 PID: 15970 Comm: test_sockmap Tainted: G        W         5.7.0-rc7-02911-g463bac5f #1
      [ 1095.977883] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
      [ 1095.980519] Call Trace:
      [ 1095.982191]  dump_stack+0x8f/0xd0
      [ 1095.984040]  sk_psock_skb_redirect+0xa6/0xf0
      [ 1095.986073]  sk_psock_tls_strp_read+0x1d8/0x250
      [ 1095.988095]  tls_sw_recvmsg+0x714/0x840 [tls]
      
      v2: Improve commit message to identify non-TLS redirect plus error case
          condition as well as more common TLS case. In the process I decided
          doing the rcu_read_unlock followed by the lock/unlock inside branches
          was unnecessarily complex. We can just extend the current rcu block
          and get the same effeective without the shuffling and branching.
          Thanks Martin!
      
      Fixes: e91de6af ("bpf: Fix running sk_skb program types with ktls")
      Reported-by: NJakub Sitnicki <jakub@cloudflare.com>
      Reported-by: Nkernel test robot <rong.a.chen@intel.com>
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NJakub Sitnicki <jakub@cloudflare.com>
      Link: https://lore.kernel.org/bpf/159312677907.18340.11064813152758406626.stgit@john-XPS-13-9370
      93dd5f18
  7. 02 6月, 2020 2 次提交
    • J
      bpf: Fix running sk_skb program types with ktls · e91de6af
      John Fastabend 提交于
      KTLS uses a stream parser to collect TLS messages and send them to
      the upper layer tls receive handler. This ensures the tls receiver
      has a full TLS header to parse when it is run. However, when a
      socket has BPF_SK_SKB_STREAM_VERDICT program attached before KTLS
      is enabled we end up with two stream parsers running on the same
      socket.
      
      The result is both try to run on the same socket. First the KTLS
      stream parser runs and calls read_sock() which will tcp_read_sock
      which in turn calls tcp_rcv_skb(). This dequeues the skb from the
      sk_receive_queue. When this is done KTLS code then data_ready()
      callback which because we stacked KTLS on top of the bpf stream
      verdict program has been replaced with sk_psock_start_strp(). This
      will in turn kick the stream parser again and eventually do the
      same thing KTLS did above calling into tcp_rcv_skb() and dequeuing
      a skb from the sk_receive_queue.
      
      At this point the data stream is broke. Part of the stream was
      handled by the KTLS side some other bytes may have been handled
      by the BPF side. Generally this results in either missing data
      or more likely a "Bad Message" complaint from the kTLS receive
      handler as the BPF program steals some bytes meant to be in a
      TLS header and/or the TLS header length is no longer correct.
      
      We've already broke the idealized model where we can stack ULPs
      in any order with generic callbacks on the TX side to handle this.
      So in this patch we do the same thing but for RX side. We add
      a sk_psock_strp_enabled() helper so TLS can learn a BPF verdict
      program is running and add a tls_sw_has_ctx_rx() helper so BPF
      side can learn there is a TLS ULP on the socket.
      
      Then on BPF side we omit calling our stream parser to avoid
      breaking the data stream for the KTLS receiver. Then on the
      KTLS side we call BPF_SK_SKB_STREAM_VERDICT once the KTLS
      receiver is done with the packet but before it posts the
      msg to userspace. This gives us symmetry between the TX and
      RX halfs and IMO makes it usable again. On the TX side we
      process packets in this order BPF -> TLS -> TCP and on
      the receive side in the reverse order TCP -> TLS -> BPF.
      
      Discovered while testing OpenSSL 3.0 Alpha2.0 release.
      
      Fixes: d829e9c4 ("tls: convert to generic sk_msg interface")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/159079361946.5745.605854335665044485.stgit@john-Precision-5820-TowerSigned-off-by: NAlexei Starovoitov <ast@kernel.org>
      e91de6af
    • J
      bpf: Refactor sockmap redirect code so its easy to reuse · ca2f5f21
      John Fastabend 提交于
      We will need this block of code called from tls context shortly
      lets refactor the redirect logic so its easy to use. This also
      cleans up the switch stmt so we have fewer fallthrough cases.
      
      No logic changes are intended.
      
      Fixes: d829e9c4 ("tls: convert to generic sk_msg interface")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Reviewed-by: NJakub Sitnicki <jakub@cloudflare.com>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Link: https://lore.kernel.org/bpf/159079360110.5745.7024009076049029819.stgit@john-Precision-5820-TowerSigned-off-by: NAlexei Starovoitov <ast@kernel.org>
      ca2f5f21
  8. 25 2月, 2020 1 次提交
  9. 22 2月, 2020 1 次提交
    • J
      net, sk_msg: Clear sk_user_data pointer on clone if tagged · f1ff5ce2
      Jakub Sitnicki 提交于
      sk_user_data can hold a pointer to an object that is not intended to be
      shared between the parent socket and the child that gets a pointer copy on
      clone. This is the case when sk_user_data points at reference-counted
      object, like struct sk_psock.
      
      One way to resolve it is to tag the pointer with a no-copy flag by
      repurposing its lowest bit. Based on the bit-flag value we clear the child
      sk_user_data pointer after cloning the parent socket.
      
      The no-copy flag is stored in the pointer itself as opposed to externally,
      say in socket flags, to guarantee that the pointer and the flag are copied
      from parent to child socket in an atomic fashion. Parent socket state is
      subject to change while copying, we don't hold any locks at that time.
      
      This approach relies on an assumption that sk_user_data holds a pointer to
      an object aligned at least 2 bytes. A manual audit of existing users of
      rcu_dereference_sk_user_data helper confirms our assumption.
      
      Also, an RCU-protected sk_user_data is not likely to hold a pointer to a
      char value or a pathological case of "struct { char c; }". To be safe, warn
      when the flag-bit is set when setting sk_user_data to catch any future
      misuses.
      
      It is worth considering why clearing sk_user_data unconditionally is not an
      option. There exist users, DRBD, NVMe, and Xen drivers being among them,
      that rely on the pointer being copied when cloning the listening socket.
      
      Potentially we could distinguish these users by checking if the listening
      socket has been created in kernel-space via sock_create_kern, and hence has
      sk_kern_sock flag set. However, this is not the case for NVMe and Xen
      drivers, which create sockets without marking them as belonging to the
      kernel.
      Signed-off-by: NJakub Sitnicki <jakub@cloudflare.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/20200218171023.844439-3-jakub@cloudflare.com
      f1ff5ce2
  10. 23 1月, 2020 1 次提交
  11. 16 1月, 2020 1 次提交
    • J
      bpf: Sockmap, ensure sock lock held during tear down · 7e81a353
      John Fastabend 提交于
      The sock_map_free() and sock_hash_free() paths used to delete sockmap
      and sockhash maps walk the maps and destroy psock and bpf state associated
      with the socks in the map. When done the socks no longer have BPF programs
      attached and will function normally. This can happen while the socks in
      the map are still "live" meaning data may be sent/received during the walk.
      
      Currently, though we don't take the sock_lock when the psock and bpf state
      is removed through this path. Specifically, this means we can be writing
      into the ops structure pointers such as sendmsg, sendpage, recvmsg, etc.
      while they are also being called from the networking side. This is not
      safe, we never used proper READ_ONCE/WRITE_ONCE semantics here if we
      believed it was safe. Further its not clear to me its even a good idea
      to try and do this on "live" sockets while networking side might also
      be using the socket. Instead of trying to reason about using the socks
      from both sides lets realize that every use case I'm aware of rarely
      deletes maps, in fact kubernetes/Cilium case builds map at init and
      never tears it down except on errors. So lets do the simple fix and
      grab sock lock.
      
      This patch wraps sock deletes from maps in sock lock and adds some
      annotations so we catch any other cases easier.
      
      Fixes: 604326b4 ("bpf, sockmap: convert to generic sk_msg interface")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/bpf/20200111061206.8028-3-john.fastabend@gmail.com
      7e81a353
  12. 29 11月, 2019 1 次提交
    • J
      net: skmsg: fix TLS 1.3 crash with full sk_msg · 031097d9
      Jakub Kicinski 提交于
      TLS 1.3 started using the entry at the end of the SG array
      for chaining-in the single byte content type entry. This mostly
      works:
      
      [ E E E E E E . . ]
        ^           ^
         start       end
      
                       E < content type
                     /
      [ E E E E E E C . ]
        ^           ^
         start       end
      
      (Where E denotes a populated SG entry; C denotes a chaining entry.)
      
      If the array is full, however, the end will point to the start:
      
      [ E E E E E E E E ]
        ^
         start
         end
      
      And we end up overwriting the start:
      
          E < content type
         /
      [ C E E E E E E E ]
        ^
         start
         end
      
      The sg array is supposed to be a circular buffer with start and
      end markers pointing anywhere. In case where start > end
      (i.e. the circular buffer has "wrapped") there is an extra entry
      reserved at the end to chain the two halves together.
      
      [ E E E E E E . . l ]
      
      (Where l is the reserved entry for "looping" back to front.
      
      As suggested by John, let's reserve another entry for chaining
      SG entries after the main circular buffer. Note that this entry
      has to be pointed to by the end entry so its position is not fixed.
      
      Examples of full messages:
      
      [ E E E E E E E E . l ]
        ^               ^
         start           end
      
         <---------------.
      [ E E . E E E E E E l ]
            ^ ^
         end   start
      
      Now the end will always point to an unused entry, so TLS 1.3
      can always use it.
      
      Fixes: 130b392c ("net: tls: Add tls 1.3 support")
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Reviewed-by: NSimon Horman <simon.horman@netronome.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      031097d9
  13. 22 11月, 2019 1 次提交
  14. 06 11月, 2019 1 次提交
  15. 25 8月, 2019 1 次提交
  16. 22 7月, 2019 1 次提交
    • J
      bpf: sockmap/tls, close can race with map free · 95fa1454
      John Fastabend 提交于
      When a map free is called and in parallel a socket is closed we
      have two paths that can potentially reset the socket prot ops, the
      bpf close() path and the map free path. This creates a problem
      with which prot ops should be used from the socket closed side.
      
      If the map_free side completes first then we want to call the
      original lowest level ops. However, if the tls path runs first
      we want to call the sockmap ops. Additionally there was no locking
      around prot updates in TLS code paths so the prot ops could
      be changed multiple times once from TLS path and again from sockmap
      side potentially leaving ops pointed at either TLS or sockmap
      when psock and/or tls context have already been destroyed.
      
      To fix this race first only update ops inside callback lock
      so that TLS, sockmap and lowest level all agree on prot state.
      Second and a ULP callback update() so that lower layers can
      inform the upper layer when they are being removed allowing the
      upper layer to reset prot ops.
      
      This gets us close to allowing sockmap and tls to be stacked
      in arbitrary order but will save that patch for *next trees.
      
      v4:
       - make sure we don't free things for device;
       - remove the checks which swap the callbacks back
         only if TLS is at the top.
      
      Reported-by: syzbot+06537213db7ba2745c4a@syzkaller.appspotmail.com
      Fixes: 02c558b2 ("bpf: sockmap, support for msg_peek in sk_msg with redirect ingress")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Reviewed-by: NDirk van der Merwe <dirk.vandermerwe@netronome.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      95fa1454
  17. 14 5月, 2019 2 次提交
    • J
      bpf: sockmap fix msg->sg.size account on ingress skb · cabede8b
      John Fastabend 提交于
      When converting a skb to msg->sg we forget to set the size after the
      latest ktls/tls code conversion. This patch can be reached by doing
      a redir into ingress path from BPF skb sock recv hook. Then trying to
      read the size fails.
      
      Fix this by setting the size.
      
      Fixes: 604326b4 ("bpf, sockmap: convert to generic sk_msg interface")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      cabede8b
    • J
      bpf: sockmap, only stop/flush strp if it was enabled at some point · 01489436
      John Fastabend 提交于
      If we try to call strp_done on a parser that has never been
      initialized, because the sockmap user is only using TX side for
      example we get the following error.
      
        [  883.422081] WARNING: CPU: 1 PID: 208 at kernel/workqueue.c:3030 __flush_work+0x1ca/0x1e0
        ...
        [  883.422095] Workqueue: events sk_psock_destroy_deferred
        [  883.422097] RIP: 0010:__flush_work+0x1ca/0x1e0
      
      This had been wrapped in a 'if (psock->parser.enabled)' logic which
      was broken because the strp_done() was never actually being called
      because we do a strp_stop() earlier in the tear down logic will
      set parser.enabled to false. This could result in a use after free
      if work was still in the queue and was resolved by the patch here,
      1d79895a ("sk_msg: Always cancel strp work before freeing the
      psock"). However, calling strp_stop(), done by the patch marked in
      the fixes tag, only is useful if we never initialized a strp parser
      program and never initialized the strp to start with. Because if
      we had initialized a stream parser strp_stop() would have been called
      by sk_psock_drop() earlier in the tear down process.  By forcing the
      strp to stop we get past the WARNING in strp_done that checks
      the stopped flag but calling cancel_work_sync on work that has never
      been initialized is also wrong and generates the warning above.
      
      To fix check if the parser program exists. If the program exists
      then the strp work has been initialized and must be sync'd and
      cancelled before free'ing any structures. If no program exists we
      never initialized the stream parser in the first place so skip the
      sync/cancel logic implemented by strp_done.
      
      Finally, remove the strp_done its not needed and in the case where we
      are using the stream parser has already been called.
      
      Fixes: e8e34377 ("bpf: Stop the psock parser before canceling its work")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      01489436
  18. 07 3月, 2019 1 次提交
    • J
      bpf: Stop the psock parser before canceling its work · e8e34377
      Jakub Sitnicki 提交于
      We might have never enabled (started) the psock's parser, in which case it
      will not get stopped when destroying the psock. This leads to a warning
      when trying to cancel parser's work from psock's deferred destructor:
      
      [  405.325769] WARNING: CPU: 1 PID: 3216 at net/strparser/strparser.c:526 strp_done+0x3c/0x40
      [  405.326712] Modules linked in: [last unloaded: test_bpf]
      [  405.327359] CPU: 1 PID: 3216 Comm: kworker/1:164 Tainted: G        W         5.0.0 #42
      [  405.328294] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20180531_142017-buildhw-08.phx2.fedoraproject.org-1.fc28 04/01/2014
      [  405.329712] Workqueue: events sk_psock_destroy_deferred
      [  405.330254] RIP: 0010:strp_done+0x3c/0x40
      [  405.330706] Code: 28 e8 b8 d5 6b ff 48 8d bb 80 00 00 00 e8 9c d5 6b ff 48 8b 7b 18 48 85 ff 74 0d e8 1e a5 e8 ff 48 c7 43 18 00 00 00 00 5b c3 <0f> 0b eb cf 66 66 66 66 90 55 89 f5 53 48 89 fb 48 83 c7 28 e8 0b
      [  405.332862] RSP: 0018:ffffc900026bbe50 EFLAGS: 00010246
      [  405.333482] RAX: ffffffff819323e0 RBX: ffff88812cb83640 RCX: ffff88812cb829e8
      [  405.334228] RDX: 0000000000000001 RSI: ffff88812cb837e8 RDI: ffff88812cb83640
      [  405.335366] RBP: ffff88813fd22680 R08: 0000000000000000 R09: 000073746e657665
      [  405.336472] R10: 8080808080808080 R11: 0000000000000001 R12: ffff88812cb83600
      [  405.337760] R13: 0000000000000000 R14: ffff88811f401780 R15: ffff88812cb837e8
      [  405.338777] FS:  0000000000000000(0000) GS:ffff88813fd00000(0000) knlGS:0000000000000000
      [  405.339903] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  405.340821] CR2: 00007fb11489a6b8 CR3: 000000012d4d6000 CR4: 00000000000406e0
      [  405.341981] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  405.343131] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [  405.344415] Call Trace:
      [  405.344821]  sk_psock_destroy_deferred+0x23/0x1b0
      [  405.345585]  process_one_work+0x1ae/0x3e0
      [  405.346110]  worker_thread+0x3c/0x3b0
      [  405.346576]  ? pwq_unbound_release_workfn+0xd0/0xd0
      [  405.347187]  kthread+0x11d/0x140
      [  405.347601]  ? __kthread_parkme+0x80/0x80
      [  405.348108]  ret_from_fork+0x35/0x40
      [  405.348566] ---[ end trace a4a3af4026a327d4 ]---
      
      Stop psock's parser just before canceling its work.
      
      Fixes: 1d79895a ("sk_msg: Always cancel strp work before freeing the psock")
      Reported-by: Nkernel test robot <rong.a.chen@intel.com>
      Signed-off-by: NJakub Sitnicki <jakub@cloudflare.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      e8e34377
  19. 29 1月, 2019 1 次提交
    • J
      sk_msg: Always cancel strp work before freeing the psock · 1d79895a
      Jakub Sitnicki 提交于
      Despite having stopped the parser, we still need to deinitialize it
      by calling strp_done so that it cancels its work. Otherwise the worker
      thread can run after we have freed the parser, and attempt to access
      its workqueue resulting in a use-after-free:
      
      ==================================================================
      BUG: KASAN: use-after-free in pwq_activate_delayed_work+0x1b/0x1d0
      Read of size 8 at addr ffff888069975240 by task kworker/u2:2/93
      
      CPU: 0 PID: 93 Comm: kworker/u2:2 Not tainted 5.0.0-rc2-00335-g28f9d1a3-dirty #14
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-2.fc27 04/01/2014
      Workqueue:            (null) (kstrp)
      Call Trace:
       print_address_description+0x6e/0x2b0
       ? pwq_activate_delayed_work+0x1b/0x1d0
       kasan_report+0xfd/0x177
       ? pwq_activate_delayed_work+0x1b/0x1d0
       ? pwq_activate_delayed_work+0x1b/0x1d0
       pwq_activate_delayed_work+0x1b/0x1d0
       ? process_one_work+0x4aa/0x660
       pwq_dec_nr_in_flight+0x9b/0x100
       worker_thread+0x82/0x680
       ? process_one_work+0x660/0x660
       kthread+0x1b9/0x1e0
       ? __kthread_create_on_node+0x250/0x250
       ret_from_fork+0x1f/0x30
      
      Allocated by task 111:
       sk_psock_init+0x3c/0x1b0
       sock_map_link.isra.2+0x103/0x4b0
       sock_map_update_common+0x94/0x270
       sock_map_update_elem+0x145/0x160
       __se_sys_bpf+0x152e/0x1e10
       do_syscall_64+0xb2/0x3e0
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Freed by task 112:
       kfree+0x7f/0x140
       process_one_work+0x40b/0x660
       worker_thread+0x82/0x680
       kthread+0x1b9/0x1e0
       ret_from_fork+0x1f/0x30
      
      The buggy address belongs to the object at ffff888069975180
       which belongs to the cache kmalloc-512 of size 512
      The buggy address is located 192 bytes inside of
       512-byte region [ffff888069975180, ffff888069975380)
      The buggy address belongs to the page:
      page:ffffea0001a65d00 count:1 mapcount:0 mapping:ffff88806d401280 index:0x0 compound_mapcount: 0
      flags: 0x4000000000010200(slab|head)
      raw: 4000000000010200 dead000000000100 dead000000000200 ffff88806d401280
      raw: 0000000000000000 00000000800c000c 00000001ffffffff 0000000000000000
      page dumped because: kasan: bad access detected
      
      Memory state around the buggy address:
       ffff888069975100: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
       ffff888069975180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      >ffff888069975200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                                 ^
       ffff888069975280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
       ffff888069975300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      ==================================================================
      Reported-by: NMarek Majkowski <marek@cloudflare.com>
      Signed-off-by: NJakub Sitnicki <jakub@cloudflare.com>
      Link: https://lore.kernel.org/netdev/CAJPywTLwgXNEZ2dZVoa=udiZmtrWJ0q5SuBW64aYs0Y1khXX3A@mail.gmail.comAcked-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      1d79895a
  20. 18 1月, 2019 1 次提交
    • V
      Optimize sk_msg_clone() by data merge to end dst sg entry · fda497e5
      Vakul Garg 提交于
      Function sk_msg_clone has been modified to merge the data from source sg
      entry to destination sg entry if the cloned data resides in same page
      and is contiguous to the end entry of destination sk_msg. This improves
      kernel tls throughput to the tune of 10%.
      
      When the user space tls application calls sendmsg() with MSG_MORE, it leads
      to calling sk_msg_clone() with new data being cloned placed continuous to
      previously cloned data. Without this optimization, a new SG entry in
      the destination sk_msg i.e. rec->msg_plaintext in tls_clone_plaintext_msg()
      gets used. This leads to exhaustion of sg entries in rec->msg_plaintext
      even before a full 16K of allowable record data is accumulated. Hence we
      lose oppurtunity to encrypt and send a full 16K record.
      
      With this patch, the kernel tls can accumulate full 16K of record data
      irrespective of the size of data passed in sendmsg() with MSG_MORE.
      Signed-off-by: NVakul Garg <vakul.garg@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fda497e5
  21. 22 12月, 2018 1 次提交
    • V
      Prevent overflow of sk_msg in sk_msg_clone() · 5c1e7e94
      Vakul Garg 提交于
      Fixed function sk_msg_clone() to prevent overflow of 'dst' while adding
      pages in scatterlist entries. The overflow of 'dst' causes crash in kernel
      tls module while doing record encryption.
      
      Crash fixed by this patch.
      
      [   78.796119] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008
      [   78.804900] Mem abort info:
      [   78.807683]   ESR = 0x96000004
      [   78.810744]   Exception class = DABT (current EL), IL = 32 bits
      [   78.816677]   SET = 0, FnV = 0
      [   78.819727]   EA = 0, S1PTW = 0
      [   78.822873] Data abort info:
      [   78.825759]   ISV = 0, ISS = 0x00000004
      [   78.829600]   CM = 0, WnR = 0
      [   78.832576] user pgtable: 4k pages, 48-bit VAs, pgdp = 00000000bf8ee311
      [   78.839195] [0000000000000008] pgd=0000000000000000
      [   78.844081] Internal error: Oops: 96000004 [#1] PREEMPT SMP
      [   78.849642] Modules linked in: tls xt_conntrack ipt_REJECT nf_reject_ipv4 ip6table_filter ip6_tables xt_CHECKSUM cpve cpufreq_conservative lm90 ina2xx crct10dif_ce
      [   78.865377] CPU: 0 PID: 6007 Comm: openssl Not tainted 4.20.0-rc6-01647-g754d5da6-dirty #107
      [   78.874149] Hardware name: LS1043A RDB Board (DT)
      [   78.878844] pstate: 60000005 (nZCv daif -PAN -UAO)
      [   78.883632] pc : scatterwalk_copychunks+0x164/0x1c8
      [   78.888500] lr : scatterwalk_copychunks+0x160/0x1c8
      [   78.893366] sp : ffff00001d04b600
      [   78.896668] x29: ffff00001d04b600 x28: ffff80006814c680
      [   78.901970] x27: 0000000000000000 x26: ffff80006c8de786
      [   78.907272] x25: ffff00001d04b760 x24: 000000000000001a
      [   78.912573] x23: 0000000000000006 x22: ffff80006814e440
      [   78.917874] x21: 0000000000000100 x20: 0000000000000000
      [   78.923175] x19: 000081ffffffffff x18: 0000000000000400
      [   78.928476] x17: 0000000000000008 x16: 0000000000000000
      [   78.933778] x15: 0000000000000100 x14: 0000000000000001
      [   78.939079] x13: 0000000000001080 x12: 0000000000000020
      [   78.944381] x11: 0000000000001080 x10: 00000000ffff0002
      [   78.949683] x9 : ffff80006814c248 x8 : 00000000ffff0000
      [   78.954985] x7 : ffff80006814c318 x6 : ffff80006c8de786
      [   78.960286] x5 : 0000000000000f80 x4 : ffff80006c8de000
      [   78.965588] x3 : 0000000000000000 x2 : 0000000000001086
      [   78.970889] x1 : ffff7e0001b74e02 x0 : 0000000000000000
      [   78.976192] Process openssl (pid: 6007, stack limit = 0x00000000291367f9)
      [   78.982968] Call trace:
      [   78.985406]  scatterwalk_copychunks+0x164/0x1c8
      [   78.989927]  skcipher_walk_next+0x28c/0x448
      [   78.994099]  skcipher_walk_done+0xfc/0x258
      [   78.998187]  gcm_encrypt+0x434/0x4c0
      [   79.001758]  tls_push_record+0x354/0xa58 [tls]
      [   79.006194]  bpf_exec_tx_verdict+0x1e4/0x3e8 [tls]
      [   79.010978]  tls_sw_sendmsg+0x650/0x780 [tls]
      [   79.015326]  inet_sendmsg+0x2c/0xf8
      [   79.018806]  sock_sendmsg+0x18/0x30
      [   79.022284]  __sys_sendto+0x104/0x138
      [   79.025935]  __arm64_sys_sendto+0x24/0x30
      [   79.029936]  el0_svc_common+0x60/0xe8
      [   79.033588]  el0_svc_handler+0x2c/0x80
      [   79.037327]  el0_svc+0x8/0xc
      [   79.040200] Code: 6b01005f 54fff788 940169b1 f9000320 (b9400801)
      [   79.046283] ---[ end trace 74db007d069c1cf7 ]---
      
      Fixes: d829e9c4 ("tls: convert to generic sk_msg interface")
      Signed-off-by: NVakul Garg <vakul.garg@nxp.com>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5c1e7e94
  22. 21 12月, 2018 3 次提交
    • J
      bpf: sk_msg, zap ingress queue on psock down · a136678c
      John Fastabend 提交于
      In addition to releasing any cork'ed data on a psock when the psock
      is removed we should also release any skb's in the ingress work queue.
      Otherwise the skb's eventually get free'd but late in the tear
      down process so we see the WARNING due to non-zero sk_forward_alloc.
      
        void sk_stream_kill_queues(struct sock *sk)
        {
      	...
      	WARN_ON(sk->sk_forward_alloc);
      	...
        }
      
      Fixes: 604326b4 ("bpf, sockmap: convert to generic sk_msg interface")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      a136678c
    • J
      bpf: sk_msg, fix socket data_ready events · 552de910
      John Fastabend 提交于
      When a skb verdict program is in-use and either another BPF program
      redirects to that socket or the new SK_PASS support is used the
      data_ready callback does not wake up application. Instead because
      the stream parser/verdict is using the sk data_ready callback we wake
      up the stream parser/verdict block.
      
      Fix this by adding a helper to check if the stream parser block is
      enabled on the sk and if so call the saved pointer which is the
      upper layers wake up function.
      
      This fixes application stalls observed when an application is waiting
      for data in a blocking read().
      
      Fixes: d829e9c4 ("tls: convert to generic sk_msg interface")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      552de910
    • J
      bpf: skb_verdict, support SK_PASS on RX BPF path · 51199405
      John Fastabend 提交于
      Add SK_PASS verdict support to SK_SKB_VERDICT programs. Now that
      support for redirects exists we can implement SK_PASS as a redirect
      to the same socket. This simplifies the BPF programs and avoids an
      extra map lookup on RX path for simple visibility cases.
      
      Further, reduces user (BPF programmer in this context) confusion
      when their program drops skb due to lack of support.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      51199405
  23. 02 12月, 2018 1 次提交
    • P
      net/core/skmsg: Replace call_rcu_sched() with call_rcu() · 0245b80e
      Paul E. McKenney 提交于
      Now that call_rcu()'s callback is not invoked until after all
      preempt-disable regions of code have completed (in addition to explicitly
      marked RCU read-side critical sections), call_rcu() can be used in place
      of call_rcu_sched().  This commit therefore makes that change.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.ibm.com>
      Cc: John Fastabend <john.fastabend@gmail.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: <netdev@vger.kernel.org>
      0245b80e
  24. 16 10月, 2018 2 次提交
    • D
      tls: convert to generic sk_msg interface · d829e9c4
      Daniel Borkmann 提交于
      Convert kTLS over to make use of sk_msg interface for plaintext and
      encrypted scattergather data, so it reuses all the sk_msg helpers
      and data structure which later on in a second step enables to glue
      this to BPF.
      
      This also allows to remove quite a bit of open coded helpers which
      are covered by the sk_msg API. Recent changes in kTLs 80ece6a0
      ("tls: Remove redundant vars from tls record structure") and
      4e6d4720 ("tls: Add support for inplace records encryption")
      changed the data path handling a bit; while we've kept the latter
      optimization intact, we had to undo the former change to better
      fit the sk_msg model, hence the sg_aead_in and sg_aead_out have
      been brought back and are linked into the sk_msg sgs. Now the kTLS
      record contains a msg_plaintext and msg_encrypted sk_msg each.
      
      In the original code, the zerocopy_from_iter() has been used out
      of TX but also RX path. For the strparser skb-based RX path,
      we've left the zerocopy_from_iter() in decrypt_internal() mostly
      untouched, meaning it has been moved into tls_setup_from_iter()
      with charging logic removed (as not used from RX). Given RX path
      is not based on sk_msg objects, we haven't pursued setting up a
      dummy sk_msg to call into sk_msg_zerocopy_from_iter(), but it
      could be an option to prusue in a later step.
      
      Joint work with John.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      d829e9c4
    • D
      bpf, sockmap: convert to generic sk_msg interface · 604326b4
      Daniel Borkmann 提交于
      Add a generic sk_msg layer, and convert current sockmap and later
      kTLS over to make use of it. While sk_buff handles network packet
      representation from netdevice up to socket, sk_msg handles data
      representation from application to socket layer.
      
      This means that sk_msg framework spans across ULP users in the
      kernel, and enables features such as introspection or filtering
      of data with the help of BPF programs that operate on this data
      structure.
      
      Latter becomes in particular useful for kTLS where data encryption
      is deferred into the kernel, and as such enabling the kernel to
      perform L7 introspection and policy based on BPF for TLS connections
      where the record is being encrypted after BPF has run and came to
      a verdict. In order to get there, first step is to transform open
      coding of scatter-gather list handling into a common core framework
      that subsystems can use.
      
      The code itself has been split and refactored into three bigger
      pieces: i) the generic sk_msg API which deals with managing the
      scatter gather ring, providing helpers for walking and mangling,
      transferring application data from user space into it, and preparing
      it for BPF pre/post-processing, ii) the plain sock map itself
      where sockets can be attached to or detached from; these bits
      are independent of i) which can now be used also without sock
      map, and iii) the integration with plain TCP as one protocol
      to be used for processing L7 application data (later this could
      e.g. also be extended to other protocols like UDP). The semantics
      are the same with the old sock map code and therefore no change
      of user facing behavior or APIs. While pursuing this work it
      also helped finding a number of bugs in the old sockmap code
      that we've fixed already in earlier commits. The test_sockmap
      kselftest suite passes through fine as well.
      
      Joint work with John.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      604326b4