1. 28 8月, 2018 3 次提交
    • J
      bpf: sockmap, decrement copied count correctly in redirect error case · 501ca817
      John Fastabend 提交于
      Currently, when a redirect occurs in sockmap and an error occurs in
      the redirect call we unwind the scatterlist once in the error path
      of bpf_tcp_sendmsg_do_redirect() and then again in sendmsg(). Then
      in the error path of sendmsg we decrement the copied count by the
      send size.
      
      However, its possible we partially sent data before the error was
      generated. This can happen if do_tcp_sendpages() partially sends the
      scatterlist before encountering a memory pressure error. If this
      happens we need to decrement the copied value (the value tracking
      how many bytes were actually sent to TCP stack) by the number of
      remaining bytes _not_ the entire send size. Otherwise we risk
      confusing userspace.
      
      Also we don't need two calls to free the scatterlist one is
      good enough. So remove the one in bpf_tcp_sendmsg_do_redirect() and
      then properly reduce copied by the number of remaining bytes which
      may in fact be the entire send size if no bytes were sent.
      
      To do this use bool to indicate if free_start_sg() should do mem
      accounting or not.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      501ca817
    • D
      bpf, sockmap: fix psock refcount leak in bpf_tcp_recvmsg · 15c480ef
      Daniel Borkmann 提交于
      In bpf_tcp_recvmsg() we first took a reference on the psock, however
      once we find that there are skbs in the normal socket's receive queue
      we return with processing them through tcp_recvmsg(). Problem is that
      we leak the taken reference on the psock in that path. Given we don't
      really do anything with the psock at this point, move the skb_queue_empty()
      test before we fetch the psock to fix this case.
      
      Fixes: 8934ce2f ("bpf: sockmap redirect ingress support")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      15c480ef
    • D
      bpf, sockmap: fix potential use after free in bpf_tcp_close · e06fa9c1
      Daniel Borkmann 提交于
      bpf_tcp_close() we pop the psock linkage to a map via psock_map_pop().
      A parallel update on the sock hash map can happen between psock_map_pop()
      and lookup_elem_raw() where we override the element under link->hash /
      link->key. In bpf_tcp_close()'s lookup_elem_raw() we subsequently only
      test whether an element is present, but we do not test whether the
      element is infact the element we were looking for.
      
      We lock the sock in bpf_tcp_close() during that time, so do we hold
      the lock in sock_hash_update_elem(). However, the latter locks the
      sock which is newly updated, not the one we're purging from the hash
      table. This means that while one CPU is doing the lookup from bpf_tcp_close(),
      another CPU is doing the map update in parallel, dropped our sock from
      the hlist and released the psock.
      
      Subsequently the first CPU will find the new sock and attempts to drop
      and release the old sock yet another time. Fix is that we need to check
      the elements for a match after lookup, similar as we do in the sock map.
      Note that the hash tab elems are freed via RCU, so access to their
      link->hash / link->key is fine since we're under RCU read side there.
      
      Fixes: e9db4ef6 ("bpf: sockhash fix omitted bucket lock in sock_close")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      e06fa9c1
  2. 23 8月, 2018 3 次提交
  3. 17 8月, 2018 3 次提交
    • D
      bpf, sockmap: fix sock_map_ctx_update_elem race with exist/noexist · 585f5a62
      Daniel Borkmann 提交于
      The current code in sock_map_ctx_update_elem() allows for BPF_EXIST
      and BPF_NOEXIST map update flags. While on array-like maps this approach
      is rather uncommon, e.g. bpf_fd_array_map_update_elem() and others
      enforce map update flags to be BPF_ANY such that xchg() can be used
      directly, the current implementation in sock map does not guarantee
      that such operation with BPF_EXIST / BPF_NOEXIST is atomic.
      
      The initial test does a READ_ONCE(stab->sock_map[i]) to fetch the
      socket from the slot which is then tested for NULL / non-NULL. However
      later after __sock_map_ctx_update_elem(), the actual update is done
      through osock = xchg(&stab->sock_map[i], sock). Problem is that in
      the meantime a different CPU could have updated / deleted a socket
      on that specific slot and thus flag contraints won't hold anymore.
      
      I've been thinking whether best would be to just break UAPI and do
      an enforcement of BPF_ANY to check if someone actually complains,
      however trouble is that already in BPF kselftest we use BPF_NOEXIST
      for the map update, and therefore it might have been copied into
      applications already. The fix to keep the current behavior intact
      would be to add a map lock similar to the sock hash bucket lock only
      for covering the whole map.
      
      Fixes: 174a79ff ("bpf: sockmap with sk redirect support")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      585f5a62
    • D
      bpf, sockmap: fix map elem deletion race with smap_stop_sock · 166ab6f0
      Daniel Borkmann 提交于
      The smap_start_sock() and smap_stop_sock() are each protected under
      the sock->sk_callback_lock from their call-sites except in the case
      of sock_map_delete_elem() where we drop the old socket from the map
      slot. This is racy because the same sock could be part of multiple
      sock maps, so we run smap_stop_sock() in parallel, and given at that
      point psock->strp_enabled might be true on both CPUs, we might for
      example wrongly restore the sk->sk_data_ready / sk->sk_write_space.
      Therefore, hold the sock->sk_callback_lock as well on delete. Looks
      like 2f857d04 ("bpf: sockmap, remove STRPARSER map_flags and add
      multi-map support") had this right, but later on e9db4ef6 ("bpf:
      sockhash fix omitted bucket lock in sock_close") removed it again
      from delete leaving this smap_stop_sock() instance unprotected.
      
      Fixes: e9db4ef6 ("bpf: sockhash fix omitted bucket lock in sock_close")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      166ab6f0
    • D
      bpf, sockmap: fix leakage of smap_psock_map_entry · d40b0116
      Daniel Borkmann 提交于
      While working on sockmap I noticed that we do not always kfree the
      struct smap_psock_map_entry list elements which track psocks attached
      to maps. In the case of sock_hash_ctx_update_elem(), these map entries
      are allocated outside of __sock_map_ctx_update_elem() with their
      linkage to the socket hash table filled. In the case of sock array,
      the map entries are allocated inside of __sock_map_ctx_update_elem()
      and added with their linkage to the psock->maps. Both additions are
      under psock->maps_lock each.
      
      Now, we drop these elements from their psock->maps list in a few
      occasions: i) in sock array via smap_list_map_remove() when an entry
      is either deleted from the map from user space, or updated via
      user space or BPF program where we drop the old socket at that map
      slot, or the sock array is freed via sock_map_free() and drops all
      its elements; ii) for sock hash via smap_list_hash_remove() in exactly
      the same occasions as just described for sock array; iii) in the
      bpf_tcp_close() where we remove the elements from the list via
      psock_map_pop() and iterate over them dropping themselves from either
      sock array or sock hash; and last but not least iv) once again in
      smap_gc_work() which is a callback for deferring the work once the
      psock refcount hit zero and thus the socket is being destroyed.
      
      Problem is that the only case where we kfree() the list entry is
      in case iv), which at that point should have an empty list in
      normal cases. So in cases from i) to iii) we unlink the elements
      without freeing where they go out of reach from us. Hence fix is
      to properly kfree() them as well to stop the leakage. Given these
      are all handled under psock->maps_lock there is no need for deferred
      RCU freeing.
      
      I later also ran with kmemleak detector and it confirmed the finding
      as well where in the state before the fix the object goes unreferenced
      while after the patch no kmemleak report related to BPF showed up.
      
        [...]
        unreferenced object 0xffff880378eadae0 (size 64):
          comm "test_sockmap", pid 2225, jiffies 4294720701 (age 43.504s)
          hex dump (first 32 bytes):
            00 01 00 00 00 00 ad de 00 02 00 00 00 00 ad de  ................
            50 4d 75 5d 03 88 ff ff 00 00 00 00 00 00 00 00  PMu]............
          backtrace:
            [<000000005225ac3c>] sock_map_ctx_update_elem.isra.21+0xd8/0x210
            [<0000000045dd6d3c>] bpf_sock_map_update+0x29/0x60
            [<00000000877723aa>] ___bpf_prog_run+0x1e1f/0x4960
            [<000000002ef89e83>] 0xffffffffffffffff
        unreferenced object 0xffff880378ead240 (size 64):
          comm "test_sockmap", pid 2225, jiffies 4294720701 (age 43.504s)
          hex dump (first 32 bytes):
            00 01 00 00 00 00 ad de 00 02 00 00 00 00 ad de  ................
            00 44 75 5d 03 88 ff ff 00 00 00 00 00 00 00 00  .Du]............
          backtrace:
            [<000000005225ac3c>] sock_map_ctx_update_elem.isra.21+0xd8/0x210
            [<0000000030e37a3a>] sock_map_update_elem+0x125/0x240
            [<000000002e5ce36e>] map_update_elem+0x4eb/0x7b0
            [<00000000db453cc9>] __x64_sys_bpf+0x1f9/0x360
            [<0000000000763660>] do_syscall_64+0x9a/0x300
            [<00000000422a2bb2>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
            [<000000002ef89e83>] 0xffffffffffffffff
        [...]
      
      Fixes: e9db4ef6 ("bpf: sockhash fix omitted bucket lock in sock_close")
      Fixes: 54fedb42 ("bpf: sockmap, fix smap_list_map_remove when psock is in many maps")
      Fixes: 2f857d04 ("bpf: sockmap, remove STRPARSER map_flags and add multi-map support")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      d40b0116
  4. 13 8月, 2018 1 次提交
    • D
      bpf: decouple btf from seq bpf fs dump and enable more maps · e8d2bec0
      Daniel Borkmann 提交于
      Commit a26ca7c9 ("bpf: btf: Add pretty print support to
      the basic arraymap") and 699c86d6 ("bpf: btf: add pretty
      print for hash/lru_hash maps") enabled support for BTF and
      dumping via BPF fs for array and hash/lru map. However, both
      can be decoupled from each other such that regular BPF maps
      can be supported for attaching BTF key/value information,
      while not all maps necessarily need to dump via map_seq_show_elem()
      callback.
      
      The basic sanity check which is a prerequisite for all maps
      is that key/value size has to match in any case, and some maps
      can have extra checks via map_check_btf() callback, e.g.
      probing certain types or indicating no support in general. With
      that we can also enable retrieving BTF info for per-cpu map
      types and lpm.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NYonghong Song <yhs@fb.com>
      e8d2bec0
  5. 09 8月, 2018 2 次提交
  6. 18 7月, 2018 1 次提交
  7. 08 7月, 2018 5 次提交
    • J
      bpf: sockmap, convert bpf_compute_data_pointers to bpf_*_sk_skb · 0ea488ff
      John Fastabend 提交于
      In commit
      
        'bpf: bpf_compute_data uses incorrect cb structure' (8108a775)
      
      we added the routine bpf_compute_data_end_sk_skb() to compute the
      correct data_end values, but this has since been lost. In kernel
      v4.14 this was correct and the above patch was applied in it
      entirety. Then when v4.14 was merged into v4.15-rc1 net-next tree
      we lost the piece that renamed bpf_compute_data_pointers to the
      new function bpf_compute_data_end_sk_skb. This was done here,
      
      e1ea2f98 ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")
      
      When it conflicted with the following rename patch,
      
      6aaae2b6 ("bpf: rename bpf_compute_data_end into bpf_compute_data_pointers")
      
      Finally, after a refactor I thought even the function
      bpf_compute_data_end_sk_skb() was no longer needed and it was
      erroneously removed.
      
      However, we never reverted the sk_skb_convert_ctx_access() usage of
      tcp_skb_cb which had been committed and survived the merge conflict.
      Here we fix this by adding back the helper and *_data_end_sk_skb()
      usage. Using the bpf_skc_data_end mapping is not correct because it
      expects a qdisc_skb_cb object but at the sock layer this is not the
      case. Even though it happens to work here because we don't overwrite
      any data in-use at the socket layer and the cb structure is cleared
      later this has potential to create some subtle issues. But, even
      more concretely the filter.c access check uses tcp_skb_cb.
      
      And by some act of chance though,
      
      struct bpf_skb_data_end {
              struct qdisc_skb_cb        qdisc_cb;             /*     0    28 */
      
              /* XXX 4 bytes hole, try to pack */
      
              void *                     data_meta;            /*    32     8 */
              void *                     data_end;             /*    40     8 */
      
              /* size: 48, cachelines: 1, members: 3 */
              /* sum members: 44, holes: 1, sum holes: 4 */
              /* last cacheline: 48 bytes */
      };
      
      and then tcp_skb_cb,
      
      struct tcp_skb_cb {
      	[...]
                      struct {
                              __u32      flags;                /*    24     4 */
                              struct sock * sk_redir;          /*    32     8 */
                              void *     data_end;             /*    40     8 */
                      } bpf;                                   /*          24 */
              };
      
      So when we use offset_of() to track down the byte offset we get 40 in
      either case and everything continues to work. Fix this mess and use
      correct structures its unclear how long this might actually work for
      until someone moves the structs around.
      Reported-by: NMartin KaFai Lau <kafai@fb.com>
      Fixes: e1ea2f98 ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")
      Fixes: 6aaae2b6 ("bpf: rename bpf_compute_data_end into bpf_compute_data_pointers")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      0ea488ff
    • J
      bpf: sockmap, consume_skb in close path · 7ebc14d5
      John Fastabend 提交于
      Currently, when a sock is closed and the bpf_tcp_close() callback is
      used we remove memory but do not free the skb. Call consume_skb() if
      the skb is attached to the buffer.
      
      Reported-by: syzbot+d464d2c20c717ef5a6a8@syzkaller.appspotmail.com
      Fixes: 1aa12bdf ("bpf: sockmap, add sock close() hook to remove socks")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      7ebc14d5
    • J
      bpf: sockhash, disallow bpf_tcp_close and update in parallel · 99ba2b5a
      John Fastabend 提交于
      After latest lock updates there is no longer anything preventing a
      close and recvmsg call running in parallel. Additionally, we can
      race update with close if we close a socket and simultaneously update
      if via the BPF userspace API (note the cgroup ops are already run
      with sock_lock held).
      
      To resolve this take sock_lock in close and update paths.
      
      Reported-by: syzbot+b680e42077a0d7c9a0c4@syzkaller.appspotmail.com
      Fixes: e9db4ef6 ("bpf: sockhash fix omitted bucket lock in sock_close")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      99ba2b5a
    • J
      bpf: sockmap, hash table is RCU so readers do not need locks · 1d1ef005
      John Fastabend 提交于
      This removes locking from readers of RCU hash table. Its not
      necessary.
      
      Fixes: 81110384 ("bpf: sockmap, add hash map support")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      1d1ef005
    • J
      bpf: sockmap, error path can not release psock in multi-map case · 547b3aa4
      John Fastabend 提交于
      The current code, in the error path of sock_hash_ctx_update_elem,
      checks if the sock has a psock in the user data and if so decrements
      the reference count of the psock. However, if the error happens early
      in the error path we may have never incremented the psock reference
      count and if the psock exists because the sock is in another map then
      we may inadvertently decrement the reference count.
      
      Fix this by making the error path only call smap_release_sock if the
      error happens after the increment.
      
      Reported-by: syzbot+d464d2c20c717ef5a6a8@syzkaller.appspotmail.com
      Fixes: 81110384 ("bpf: sockmap, add hash map support")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      547b3aa4
  8. 01 7月, 2018 4 次提交
    • J
      bpf: sockhash, add release routine · caac76a5
      John Fastabend 提交于
      Add map_release_uref pointer to hashmap ops. This was dropped when
      original sockhash code was ported into bpf-next before initial
      commit.
      
      Fixes: 81110384 ("bpf: sockmap, add hash map support")
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      caac76a5
    • J
      bpf: sockhash fix omitted bucket lock in sock_close · e9db4ef6
      John Fastabend 提交于
      First the sk_callback_lock() was being used to protect both the
      sock callback hooks and the psock->maps list. This got overly
      convoluted after the addition of sockhash (in sockmap it made
      some sense because masp and callbacks were tightly coupled) so
      lets split out a specific lock for maps and only use the callback
      lock for its intended purpose. This fixes a couple cases where
      we missed using maps lock when it was in fact needed. Also this
      makes it easier to follow the code because now we can put the
      locking closer to the actual code its serializing.
      
      Next, in sock_hash_delete_elem() the pattern was as follows,
      
        sock_hash_delete_elem()
           [...]
           spin_lock(bucket_lock)
           l = lookup_elem_raw()
           if (l)
              hlist_del_rcu()
              write_lock(sk_callback_lock)
               .... destroy psock ...
              write_unlock(sk_callback_lock)
           spin_unlock(bucket_lock)
      
      The ordering is necessary because we only know the {p}sock after
      dereferencing the hash table which we can't do unless we have the
      bucket lock held. Once we have the bucket lock and the psock element
      it is deleted from the hashmap to ensure any other path doing a lookup
      will fail. Finally, the refcnt is decremented and if zero the psock
      is destroyed.
      
      In parallel with the above (or free'ing the map) a tcp close event
      may trigger tcp_close(). Which at the moment omits the bucket lock
      altogether (oops!) where the flow looks like this,
      
        bpf_tcp_close()
           [...]
           write_lock(sk_callback_lock)
           for each psock->maps // list of maps this sock is part of
               hlist_del_rcu(ref_hash_node);
               .... destroy psock ...
           write_unlock(sk_callback_lock)
      
      Obviously, and demonstrated by syzbot, this is broken because
      we can have multiple threads deleting entries via hlist_del_rcu().
      
      To fix this we might be tempted to wrap the hlist operation in a
      bucket lock but that would create a lock inversion problem. In
      summary to follow locking rules the psocks maps list needs the
      sk_callback_lock (after this patch maps_lock) but we need the bucket
      lock to do the hlist_del_rcu.
      
      To resolve the lock inversion problem pop the head of the maps list
      repeatedly and remove the reference until no more are left. If a
      delete happens in parallel from the BPF API that is OK as well because
      it will do a similar action, lookup the lock in the map/hash, delete
      it from the map/hash, and dec the refcnt. We check for this case
      before doing a destroy on the psock to ensure we don't have two
      threads tearing down a psock. The new logic is as follows,
      
        bpf_tcp_close()
        e = psock_map_pop(psock->maps) // done with map lock
        bucket_lock() // lock hash list bucket
        l = lookup_elem_raw(head, hash, key, key_size);
        if (l) {
           //only get here if elmnt was not already removed
           hlist_del_rcu()
           ... destroy psock...
        }
        bucket_unlock()
      
      And finally for all the above to work add missing locking around  map
      operations per above. Then add RCU annotations and use
      rcu_dereference/rcu_assign_pointer to manage values relying on RCU so
      that the object is not free'd from sock_hash_free() while it is being
      referenced in bpf_tcp_close().
      
      Reported-by: syzbot+0ce137753c78f7b6acc1@syzkaller.appspotmail.com
      Fixes: 81110384 ("bpf: sockmap, add hash map support")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      e9db4ef6
    • J
      bpf: sockmap, fix smap_list_map_remove when psock is in many maps · 54fedb42
      John Fastabend 提交于
      If a hashmap is free'd with open socks it removes the reference to
      the hash entry from the psock. If that is the last reference to the
      psock then it will also be free'd by the reference counting logic.
      However the current logic that removes the hash reference from the
      list of references is broken. In smap_list_remove() we first check
      if the sockmap entry matches and then check if the hashmap entry
      matches. But, the sockmap entry sill always match because its NULL in
      this case which causes the first entry to be removed from the list.
      If this is always the "right" entry (because the user adds/removes
      entries in order) then everything is OK but otherwise a subsequent
      bpf_tcp_close() may reference a free'd object.
      
      To fix this create two list handlers one for sockmap and one for
      sockhash.
      
      Reported-by: syzbot+0ce137753c78f7b6acc1@syzkaller.appspotmail.com
      Fixes: 81110384 ("bpf: sockmap, add hash map support")
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      54fedb42
    • J
      bpf: sockmap, fix crash when ipv6 sock is added · 9901c5d7
      John Fastabend 提交于
      This fixes a crash where we assign tcp_prot to IPv6 sockets instead
      of tcpv6_prot.
      
      Previously we overwrote the sk->prot field with tcp_prot even in the
      AF_INET6 case. This patch ensures the correct tcp_prot and tcpv6_prot
      are used.
      
      Tested with 'netserver -6' and 'netperf -H [IPv6]' as well as
      'netperf -H [IPv4]'. The ESTABLISHED check resolves the previously
      crashing case here.
      
      Fixes: 174a79ff ("bpf: sockmap with sk redirect support")
      Reported-by: syzbot+5c063698bdbfac19f363@syzkaller.appspotmail.com
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NWei Wang <weiwan@google.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      9901c5d7
  9. 26 6月, 2018 1 次提交
    • S
      bpf: fix attach type BPF_LIRC_MODE2 dependency wrt CONFIG_CGROUP_BPF · fdb5c453
      Sean Young 提交于
      If the kernel is compiled with CONFIG_CGROUP_BPF not enabled, it is not
      possible to attach, detach or query IR BPF programs to /dev/lircN devices,
      making them impossible to use. For embedded devices, it should be possible
      to use IR decoding without cgroups or CONFIG_CGROUP_BPF enabled.
      
      This change requires some refactoring, since bpf_prog_{attach,detach,query}
      functions are now always compiled, but their code paths for cgroups need
      moving out. Rather than a #ifdef CONFIG_CGROUP_BPF in kernel/bpf/syscall.c,
      moving them to kernel/bpf/cgroup.c and kernel/bpf/sockmap.c does not
      require #ifdefs since that is already conditionally compiled.
      
      Fixes: f4364dcf ("media: rc: introduce BPF_PROG_LIRC_MODE2")
      Signed-off-by: NSean Young <sean@mess.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      fdb5c453
  10. 19 5月, 2018 1 次提交
  11. 18 5月, 2018 4 次提交
    • J
      bpf: parse and verdict prog attach may race with bpf map update · 96174560
      John Fastabend 提交于
      In the sockmap design BPF programs (SK_SKB_STREAM_PARSER,
      SK_SKB_STREAM_VERDICT and SK_MSG_VERDICT) are attached to the sockmap
      map type and when a sock is added to the map the programs are used by
      the socket. However, sockmap updates from both userspace and BPF
      programs can happen concurrently with the attach and detach of these
      programs.
      
      To resolve this we use the bpf_prog_inc_not_zero and a READ_ONCE()
      primitive to ensure the program pointer is not refeched and
      possibly NULL'd before the refcnt increment. This happens inside
      a RCU critical section so although the pointer reference in the map
      object may be NULL (by a concurrent detach operation) the reference
      from READ_ONCE will not be free'd until after grace period. This
      ensures the object returned by READ_ONCE() is valid through the
      RCU criticl section and safe to use as long as we "know" it may
      be free'd shortly.
      
      Daniel spotted a case in the sock update API where instead of using
      the READ_ONCE() program reference we used the pointer from the
      original map, stab->bpf_{verdict|parse|txmsg}. The problem with this
      is the logic checks the object returned from the READ_ONCE() is not
      NULL and then tries to reference the object again but using the
      above map pointer, which may have already been NULL'd by a parallel
      detach operation. If this happened bpf_porg_inc_not_zero could
      dereference a NULL pointer.
      
      Fix this by using variable returned by READ_ONCE() that is checked
      for NULL.
      
      Fixes: 2f857d04 ("bpf: sockmap, remove STRPARSER map_flags and add multi-map support")
      Reported-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      96174560
    • J
      bpf: sockmap update rollback on error can incorrectly dec prog refcnt · a593f708
      John Fastabend 提交于
      If the user were to only attach one of the parse or verdict programs
      then it is possible a subsequent sockmap update could incorrectly
      decrement the refcnt on the program. This happens because in the
      rollback logic, after an error, we have to decrement the program
      reference count when its been incremented. However, we only increment
      the program reference count if the user has both a verdict and a
      parse program. The reason for this is because, at least at the
      moment, both are required for any one to be meaningful. The problem
      fixed here is in the rollback path we decrement the program refcnt
      even if only one existing. But we never incremented the refcnt in
      the first place creating an imbalance.
      
      This patch fixes the error path to handle this case.
      
      Fixes: 2f857d04 ("bpf: sockmap, remove STRPARSER map_flags and add multi-map support")
      Reported-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      a593f708
    • G
      bpf: sockmap, fix double-free · a7862293
      Gustavo A. R. Silva 提交于
      `e' is being freed twice.
      
      Fix this by removing one of the kfree() calls.
      
      Addresses-Coverity-ID: 1468983 ("Double free")
      Fixes: 81110384 ("bpf: sockmap, add hash map support")
      Signed-off-by: NGustavo A. R. Silva <gustavo@embeddedor.com>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      a7862293
    • G
      bpf: sockmap, fix uninitialized variable · 0e436456
      Gustavo A. R. Silva 提交于
      There is a potential execution path in which variable err is
      returned without being properly initialized previously.
      
      Fix this by initializing variable err to 0.
      
      Addresses-Coverity-ID: 1468964 ("Uninitialized scalar variable")
      Fixes: e5cd3abc ("bpf: sockmap, refactor sockmap routines to work with hashmap")
      Signed-off-by: NGustavo A. R. Silva <gustavo@embeddedor.com>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      0e436456
  12. 17 5月, 2018 2 次提交
    • J
      bpf: sockmap, on update propagate errors back to userspace · e23afe5e
      John Fastabend 提交于
      When an error happens in the update sockmap element logic also pass
      the err up to the user.
      
      Fixes: e5cd3abc ("bpf: sockmap, refactor sockmap routines to work with hashmap")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      e23afe5e
    • Y
      bpf: fix sock hashmap kmalloc warning · 683d2ac3
      Yonghong Song 提交于
      syzbot reported a kernel warning below:
        WARNING: CPU: 0 PID: 4499 at mm/slab_common.c:996 kmalloc_slab+0x56/0x70 mm/slab_common.c:996
        Kernel panic - not syncing: panic_on_warn set ...
      
        CPU: 0 PID: 4499 Comm: syz-executor050 Not tainted 4.17.0-rc3+ #9
        Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
        Call Trace:
         __dump_stack lib/dump_stack.c:77 [inline]
         dump_stack+0x1b9/0x294 lib/dump_stack.c:113
         panic+0x22f/0x4de kernel/panic.c:184
         __warn.cold.8+0x163/0x1b3 kernel/panic.c:536
         report_bug+0x252/0x2d0 lib/bug.c:186
         fixup_bug arch/x86/kernel/traps.c:178 [inline]
         do_error_trap+0x1de/0x490 arch/x86/kernel/traps.c:296
         do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:315
         invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:992
        RIP: 0010:kmalloc_slab+0x56/0x70 mm/slab_common.c:996
        RSP: 0018:ffff8801d907fc58 EFLAGS: 00010246
        RAX: 0000000000000000 RBX: ffff8801aeecb280 RCX: ffffffff8185ebd7
        RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000ffffffe1
        RBP: ffff8801d907fc58 R08: ffff8801adb5e1c0 R09: ffffed0035a84700
        R10: ffffed0035a84700 R11: ffff8801ad423803 R12: ffff8801aeecb280
        R13: 00000000fffffff4 R14: ffff8801ad891a00 R15: 00000000014200c0
         __do_kmalloc mm/slab.c:3713 [inline]
         __kmalloc+0x25/0x760 mm/slab.c:3727
         kmalloc include/linux/slab.h:517 [inline]
         map_get_next_key+0x24a/0x640 kernel/bpf/syscall.c:858
         __do_sys_bpf kernel/bpf/syscall.c:2131 [inline]
         __se_sys_bpf kernel/bpf/syscall.c:2096 [inline]
         __x64_sys_bpf+0x354/0x4f0 kernel/bpf/syscall.c:2096
         do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      The test case is against sock hashmap with a key size 0xffffffe1.
      Such a large key size will cause the below code in function
      sock_hash_alloc() overflowing and produces a smaller elem_size,
      hence map creation will be successful.
          htab->elem_size = sizeof(struct htab_elem) +
                            round_up(htab->map.key_size, 8);
      
      Later, when map_get_next_key is called and kernel tries
      to allocate the key unsuccessfully, it will issue
      the above warning.
      
      Similar to hashtab, ensure the key size is at most
      MAX_BPF_STACK for a successful map creation.
      
      Fixes: 81110384 ("bpf: sockmap, add hash map support")
      Reported-by: syzbot+e4566d29080e7f3460ff@syzkaller.appspotmail.com
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      683d2ac3
  13. 16 5月, 2018 1 次提交
  14. 15 5月, 2018 1 次提交
    • J
      bpf: sockmap, refactor sockmap routines to work with hashmap · e5cd3abc
      John Fastabend 提交于
      This patch only refactors the existing sockmap code. This will allow
      much of the psock initialization code path and bpf helper codes to
      work for both sockmap bpf map types that are backed by an array, the
      currently supported type, and the new hash backed bpf map type
      sockhash.
      
      Most the fallout comes from three changes,
      
        - Pushing bpf programs into an independent structure so we
          can use it from the htab struct in the next patch.
        - Generalizing helpers to use void *key instead of the hardcoded
          u32.
        - Instead of passing map/key through the metadata we now do
          the lookup inline. This avoids storing the key in the metadata
          which will be useful when keys can be longer than 4 bytes. We
          rename the sk pointers to sk_redir at this point as well to
          avoid any confusion between the current sk pointer and the
          redirect pointer sk_redir.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      e5cd3abc
  15. 03 5月, 2018 3 次提交
    • J
      bpf: sockmap, fix error handling in redirect failures · abaeb096
      John Fastabend 提交于
      When a redirect failure happens we release the buffers in-flight
      without calling a sk_mem_uncharge(), the uncharge is called before
      dropping the sock lock for the redirecte, however we missed updating
      the ring start index. When no apply actions are in progress this
      is OK because we uncharge the entire buffer before the redirect.
      But, when we have apply logic running its possible that only a
      portion of the buffer is being redirected. In this case we only
      do memory accounting for the buffer slice being redirected and
      expect to be able to loop over the BPF program again and/or if
      a sock is closed uncharge the memory at sock destruct time.
      
      With an invalid start index however the program logic looks at
      the start pointer index, checks the length, and when seeing the
      length is zero (from the initial release and failure to update
      the pointer) aborts without uncharging/releasing the remaining
      memory.
      
      The fix for this is simply to update the start index. To avoid
      fixing this error in two locations we do a small refactor and
      remove one case where it is open-coded. Then fix it in the
      single function.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      abaeb096
    • J
      bpf: sockmap, zero sg_size on error when buffer is released · fec51d40
      John Fastabend 提交于
      When an error occurs during a redirect we have two cases that need
      to be handled (i) we have a cork'ed buffer (ii) we have a normal
      sendmsg buffer.
      
      In the cork'ed buffer case we don't currently support recovering from
      errors in a redirect action. So the buffer is released and the error
      should _not_ be pushed back to the caller of sendmsg/sendpage. The
      rationale here is the user will get an error that relates to old
      data that may have been sent by some arbitrary thread on that sock.
      Instead we simple consume the data and tell the user that the data
      has been consumed. We may add proper error recovery in the future.
      However, this patch fixes a bug where the bytes outstanding counter
      sg_size was not zeroed. This could result in a case where if the user
      has both a cork'ed action and apply action in progress we may
      incorrectly call into the BPF program when the user expected an
      old verdict to be applied via the apply action. I don't have a use
      case where using apply and cork at the same time is valid but we
      never explicitly reject it because it should work fine. This patch
      ensures the sg_size is zeroed so we don't have this case.
      
      In the normal sendmsg buffer case (no cork data) we also do not
      zero sg_size. Again this can confuse the apply logic when the logic
      calls into the BPF program when the BPF programmer expected the old
      verdict to remain. So ensure we set sg_size to zero here as well. And
      additionally to keep the psock state in-sync with the sk_msg_buff
      release all the memory as well. Previously we did this before
      returning to the user but this left a gap where psock and sk_msg_buff
      states were out of sync which seems fragile. No additional overhead
      is taken here except for a call to check the length and realize its
      already been freed. This is in the error path as well so in my
      opinion lets have robust code over optimized error paths.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      fec51d40
    • J
      bpf: sockmap, fix scatterlist update on error path in send with apply · 3cc9a472
      John Fastabend 提交于
      When the call to do_tcp_sendpage() fails to send the complete block
      requested we either retry if only a partial send was completed or
      abort if we receive a error less than or equal to zero. Before
      returning though we must update the scatterlist length/offset to
      account for any partial send completed.
      
      Before this patch we did this at the end of the retry loop, but
      this was buggy when used while applying a verdict to fewer bytes
      than in the scatterlist. When the scatterlist length was being set
      we forgot to account for the apply logic reducing the size variable.
      So the result was we chopped off some bytes in the scatterlist without
      doing proper cleanup on them. This results in a WARNING when the
      sock is tore down because the bytes have previously been charged to
      the socket but are never uncharged.
      
      The simple fix is to simply do the accounting inside the retry loop
      subtracting from the absolute scatterlist values rather than trying
      to accumulate the totals and subtract at the end.
      Reported-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      3cc9a472
  16. 24 4月, 2018 3 次提交
    • J
      bpf: sockmap, fix double page_put on ENOMEM error in redirect path · 4fcfdfb8
      John Fastabend 提交于
      In the case where the socket memory boundary is hit the redirect
      path returns an ENOMEM error. However, before checking for this
      condition the redirect scatterlist buffer is setup with a valid
      page and length. This is never unwound so when the buffers are
      released latter in the error path we do a put_page() and clear
      the scatterlist fields. But, because the initial error happens
      before completing the scatterlist buffer we end up with both the
      original buffer and the redirect buffer pointing to the same page
      resulting in duplicate put_page() calls.
      
      To fix this simply move the initial configuration of the redirect
      scatterlist buffer below the sock memory check.
      
      Found this while running TCP_STREAM test with netperf using Cilium.
      
      Fixes: fa246693 ("bpf: sockmap, BPF_F_INGRESS flag for BPF_SK_SKB_STREAM_VERDICT")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      4fcfdfb8
    • J
      bpf: sockmap, sk_wait_event needed to handle blocking cases · e20f7334
      John Fastabend 提交于
      In the recvmsg handler we need to add a wait event to support the
      blocking use cases. Without this we return zero and may confuse
      user applications. In the wait event any data received on the
      sk either via sk_receive_queue or the psock ingress list will
      wake up the sock.
      
      Fixes: fa246693 ("bpf: sockmap, BPF_F_INGRESS flag for BPF_SK_SKB_STREAM_VERDICT")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      e20f7334
    • J
      bpf: sockmap, map_release does not hold refcnt for pinned maps · ba6b8de4
      John Fastabend 提交于
      Relying on map_release hook to decrement the reference counts when a
      map is removed only works if the map is not being pinned. In the
      pinned case the ref is decremented immediately and the BPF programs
      released. After this BPF programs may not be in-use which is not
      what the user would expect.
      
      This patch moves the release logic into bpf_map_put_uref() and brings
      sockmap in-line with how a similar case is handled in prog array maps.
      
      Fixes: 3d9e9526 ("bpf: sockmap, fix leaking maps with attached but not detached progs")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      ba6b8de4
  17. 21 4月, 2018 1 次提交
  18. 04 4月, 2018 1 次提交