1. 03 12月, 2020 2 次提交
  2. 16 10月, 2020 2 次提交
  3. 12 10月, 2020 2 次提交
  4. 01 10月, 2020 1 次提交
    • D
      bpf, net: Rework cookie generator as per-cpu one · 92acdc58
      Daniel Borkmann 提交于
      With its use in BPF, the cookie generator can be called very frequently
      in particular when used out of cgroup v2 hooks (e.g. connect / sendmsg)
      and attached to the root cgroup, for example, when used in v1/v2 mixed
      environments. In particular, when there's a high churn on sockets in the
      system there can be many parallel requests to the bpf_get_socket_cookie()
      and bpf_get_netns_cookie() helpers which then cause contention on the
      atomic counter.
      
      As similarly done in f991bd2e ("fs: introduce a per-cpu last_ino
      allocator"), add a small helper library that both can use for the 64 bit
      counters. Given this can be called from different contexts, we also need
      to deal with potential nested calls even though in practice they are
      considered extremely rare. One idea as suggested by Eric Dumazet was
      to use a reverse counter for this situation since we don't expect 64 bit
      overflows anyways; that way, we can avoid bigger gaps in the 64 bit
      counter space compared to just batch-wise increase. Even on machines
      with small number of cores (e.g. 4) the cookie generation shrinks from
      min/max/med/avg (ns) of 22/50/40/38.9 down to 10/35/14/17.3 when run
      in parallel from multiple CPUs.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Link: https://lore.kernel.org/bpf/8a80b8d27d3c49f9a14e1d5213c19d8be87d1dc8.1601477936.git.daniel@iogearbox.net
      92acdc58
  5. 29 9月, 2020 1 次提交
  6. 11 9月, 2020 2 次提交
  7. 28 8月, 2020 1 次提交
    • M
      bpf: Add map_meta_equal map ops · f4d05259
      Martin KaFai Lau 提交于
      Some properties of the inner map is used in the verification time.
      When an inner map is inserted to an outer map at runtime,
      bpf_map_meta_equal() is currently used to ensure those properties
      of the inserting inner map stays the same as the verification
      time.
      
      In particular, the current bpf_map_meta_equal() checks max_entries which
      turns out to be too restrictive for most of the maps which do not use
      max_entries during the verification time.  It limits the use case that
      wants to replace a smaller inner map with a larger inner map.  There are
      some maps do use max_entries during verification though.  For example,
      the map_gen_lookup in array_map_ops uses the max_entries to generate
      the inline lookup code.
      
      To accommodate differences between maps, the map_meta_equal is added
      to bpf_map_ops.  Each map-type can decide what to check when its
      map is used as an inner map during runtime.
      
      Also, some map types cannot be used as an inner map and they are
      currently black listed in bpf_map_meta_alloc() in map_in_map.c.
      It is not unusual that the new map types may not aware that such
      blacklist exists.  This patch enforces an explicit opt-in
      and only allows a map to be used as an inner map if it has
      implemented the map_meta_equal ops.  It is based on the
      discussion in [1].
      
      All maps that support inner map has its map_meta_equal points
      to bpf_map_meta_equal in this patch.  A later patch will
      relax the max_entries check for most maps.  bpf_types.h
      counts 28 map types.  This patch adds 23 ".map_meta_equal"
      by using coccinelle.  -5 for
      	BPF_MAP_TYPE_PROG_ARRAY
      	BPF_MAP_TYPE_(PERCPU)_CGROUP_STORAGE
      	BPF_MAP_TYPE_STRUCT_OPS
      	BPF_MAP_TYPE_ARRAY_OF_MAPS
      	BPF_MAP_TYPE_HASH_OF_MAPS
      
      The "if (inner_map->inner_map_meta)" check in bpf_map_meta_alloc()
      is moved such that the same error is returned.
      
      [1]: https://lore.kernel.org/bpf/20200522022342.899756-1-kafai@fb.com/Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20200828011806.1970400-1-kafai@fb.com
      f4d05259
  8. 22 8月, 2020 4 次提交
  9. 01 7月, 2020 2 次提交
  10. 23 6月, 2020 2 次提交
  11. 13 6月, 2020 2 次提交
  12. 10 6月, 2020 2 次提交
    • J
      bpf, sockhash: Synchronize delete from bucket list on map free · 75e68e5b
      Jakub Sitnicki 提交于
      We can end up modifying the sockhash bucket list from two CPUs when a
      sockhash is being destroyed (sock_hash_free) on one CPU, while a socket
      that is in the sockhash is unlinking itself from it on another CPU
      it (sock_hash_delete_from_link).
      
      This results in accessing a list element that is in an undefined state as
      reported by KASAN:
      
      | ==================================================================
      | BUG: KASAN: wild-memory-access in sock_hash_free+0x13c/0x280
      | Write of size 8 at addr dead000000000122 by task kworker/2:1/95
      |
      | CPU: 2 PID: 95 Comm: kworker/2:1 Not tainted 5.7.0-rc7-02961-ge22c35ab0038-dirty #691
      | Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014
      | Workqueue: events bpf_map_free_deferred
      | Call Trace:
      |  dump_stack+0x97/0xe0
      |  ? sock_hash_free+0x13c/0x280
      |  __kasan_report.cold+0x5/0x40
      |  ? mark_lock+0xbc1/0xc00
      |  ? sock_hash_free+0x13c/0x280
      |  kasan_report+0x38/0x50
      |  ? sock_hash_free+0x152/0x280
      |  sock_hash_free+0x13c/0x280
      |  bpf_map_free_deferred+0xb2/0xd0
      |  ? bpf_map_charge_finish+0x50/0x50
      |  ? rcu_read_lock_sched_held+0x81/0xb0
      |  ? rcu_read_lock_bh_held+0x90/0x90
      |  process_one_work+0x59a/0xac0
      |  ? lock_release+0x3b0/0x3b0
      |  ? pwq_dec_nr_in_flight+0x110/0x110
      |  ? rwlock_bug.part.0+0x60/0x60
      |  worker_thread+0x7a/0x680
      |  ? _raw_spin_unlock_irqrestore+0x4c/0x60
      |  kthread+0x1cc/0x220
      |  ? process_one_work+0xac0/0xac0
      |  ? kthread_create_on_node+0xa0/0xa0
      |  ret_from_fork+0x24/0x30
      | ==================================================================
      
      Fix it by reintroducing spin-lock protected critical section around the
      code that removes the elements from the bucket on sockhash free.
      
      To do that we also need to defer processing of removed elements, until out
      of atomic context so that we can unlink the socket from the map when
      holding the sock lock.
      
      Fixes: 90db6d77 ("bpf, sockmap: Remove bucket->lock from sock_{hash|map}_free")
      Reported-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NJakub Sitnicki <jakub@cloudflare.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Link: https://lore.kernel.org/bpf/20200607205229.2389672-3-jakub@cloudflare.com
      75e68e5b
    • J
      bpf, sockhash: Fix memory leak when unlinking sockets in sock_hash_free · 33a7c831
      Jakub Sitnicki 提交于
      When sockhash gets destroyed while sockets are still linked to it, we will
      walk the bucket lists and delete the links. However, we are not freeing the
      list elements after processing them, leaking the memory.
      
      The leak can be triggered by close()'ing a sockhash map when it still
      contains sockets, and observed with kmemleak:
      
        unreferenced object 0xffff888116e86f00 (size 64):
          comm "race_sock_unlin", pid 223, jiffies 4294731063 (age 217.404s)
          hex dump (first 32 bytes):
            00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
            81 de e8 41 00 00 00 00 c0 69 2f 15 81 88 ff ff  ...A.....i/.....
          backtrace:
            [<00000000dd089ebb>] sock_hash_update_common+0x4ca/0x760
            [<00000000b8219bd5>] sock_hash_update_elem+0x1d2/0x200
            [<000000005e2c23de>] __do_sys_bpf+0x2046/0x2990
            [<00000000d0084618>] do_syscall_64+0xad/0x9a0
            [<000000000d96f263>] entry_SYSCALL_64_after_hwframe+0x49/0xb3
      
      Fix it by freeing the list element when we're done with it.
      
      Fixes: 604326b4 ("bpf, sockmap: convert to generic sk_msg interface")
      Signed-off-by: NJakub Sitnicki <jakub@cloudflare.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Link: https://lore.kernel.org/bpf/20200607205229.2389672-2-jakub@cloudflare.com
      33a7c831
  13. 30 4月, 2020 1 次提交
  14. 11 3月, 2020 1 次提交
    • J
      bpf, sockmap: Remove bucket->lock from sock_{hash|map}_free · 90db6d77
      John Fastabend 提交于
      The bucket->lock is not needed in the sock_hash_free and sock_map_free
      calls, in fact it is causing a splat due to being inside rcu block.
      
      | BUG: sleeping function called from invalid context at net/core/sock.c:2935
      | in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 62, name: kworker/0:1
      | 3 locks held by kworker/0:1/62:
      |  #0: ffff88813b019748 ((wq_completion)events){+.+.}, at: process_one_work+0x1d7/0x5e0
      |  #1: ffffc900000abe50 ((work_completion)(&map->work)){+.+.}, at: process_one_work+0x1d7/0x5e0
      |  #2: ffff8881381f6df8 (&stab->lock){+...}, at: sock_map_free+0x26/0x180
      | CPU: 0 PID: 62 Comm: kworker/0:1 Not tainted 5.5.0-04008-g7b083332376e #454
      | Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014
      | Workqueue: events bpf_map_free_deferred
      | Call Trace:
      |  dump_stack+0x71/0xa0
      |  ___might_sleep.cold+0xa6/0xb6
      |  lock_sock_nested+0x28/0x90
      |  sock_map_free+0x5f/0x180
      |  bpf_map_free_deferred+0x58/0x80
      |  process_one_work+0x260/0x5e0
      |  worker_thread+0x4d/0x3e0
      |  kthread+0x108/0x140
      |  ? process_one_work+0x5e0/0x5e0
      |  ? kthread_park+0x90/0x90
      |  ret_from_fork+0x3a/0x50
      
      The reason we have stab->lock and bucket->locks in sockmap code is to
      handle checking EEXIST in update/delete cases. We need to be careful during
      an update operation that we check for EEXIST and we need to ensure that the
      psock object is not in some partial state of removal/insertion while we do
      this. So both map_update_common and sock_map_delete need to guard from being
      run together potentially deleting an entry we are checking, etc. But by the
      time we get to the tear-down code in sock_{ma[|hash}_free we have already
      disconnected the map and we just did synchronize_rcu() in the line above so
      no updates/deletes should be in flight. Because of this we can drop the
      bucket locks from the map free'ing code, noting no update/deletes can be
      in-flight.
      
      Fixes: 604326b4 ("bpf, sockmap: convert to generic sk_msg interface")
      Reported-by: NJakub Sitnicki <jakub@cloudflare.com>
      Suggested-by: NJakub Sitnicki <jakub@cloudflare.com>
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/158385850787.30597.8346421465837046618.stgit@john-Precision-5820-Tower
      90db6d77
  15. 10 3月, 2020 5 次提交
  16. 22 2月, 2020 4 次提交
  17. 18 2月, 2020 1 次提交
  18. 08 2月, 2020 3 次提交
    • J
      bpf, sockhash: Synchronize_rcu before free'ing map · 0b2dc839
      Jakub Sitnicki 提交于
      We need to have a synchronize_rcu before free'ing the sockhash because any
      outstanding psock references will have a pointer to the map and when they
      use it, this could trigger a use after free.
      
      This is a sister fix for sockhash, following commit 2bb90e5c ("bpf:
      sockmap, synchronize_rcu before free'ing map") which addressed sockmap,
      which comes from a manual audit.
      
      Fixes: 604326b4 ("bpf, sockmap: convert to generic sk_msg interface")
      Signed-off-by: NJakub Sitnicki <jakub@cloudflare.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Link: https://lore.kernel.org/bpf/20200206111652.694507-3-jakub@cloudflare.com
      0b2dc839
    • J
      bpf, sockmap: Don't sleep while holding RCU lock on tear-down · db6a5018
      Jakub Sitnicki 提交于
      rcu_read_lock is needed to protect access to psock inside sock_map_unref
      when tearing down the map. However, we can't afford to sleep in lock_sock
      while in RCU read-side critical section. Grab the RCU lock only after we
      have locked the socket.
      
      This fixes RCU warnings triggerable on a VM with 1 vCPU when free'ing a
      sockmap/sockhash that contains at least one socket:
      
      | =============================
      | WARNING: suspicious RCU usage
      | 5.5.0-04005-g8fc91b97 #450 Not tainted
      | -----------------------------
      | include/linux/rcupdate.h:272 Illegal context switch in RCU read-side critical section!
      |
      | other info that might help us debug this:
      |
      |
      | rcu_scheduler_active = 2, debug_locks = 1
      | 4 locks held by kworker/0:1/62:
      |  #0: ffff88813b019748 ((wq_completion)events){+.+.}, at: process_one_work+0x1d7/0x5e0
      |  #1: ffffc900000abe50 ((work_completion)(&map->work)){+.+.}, at: process_one_work+0x1d7/0x5e0
      |  #2: ffffffff82065d20 (rcu_read_lock){....}, at: sock_map_free+0x5/0x170
      |  #3: ffff8881368c5df8 (&stab->lock){+...}, at: sock_map_free+0x64/0x170
      |
      | stack backtrace:
      | CPU: 0 PID: 62 Comm: kworker/0:1 Not tainted 5.5.0-04005-g8fc91b97 #450
      | Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014
      | Workqueue: events bpf_map_free_deferred
      | Call Trace:
      |  dump_stack+0x71/0xa0
      |  ___might_sleep+0x105/0x190
      |  lock_sock_nested+0x28/0x90
      |  sock_map_free+0x95/0x170
      |  bpf_map_free_deferred+0x58/0x80
      |  process_one_work+0x260/0x5e0
      |  worker_thread+0x4d/0x3e0
      |  kthread+0x108/0x140
      |  ? process_one_work+0x5e0/0x5e0
      |  ? kthread_park+0x90/0x90
      |  ret_from_fork+0x3a/0x50
      
      | =============================
      | WARNING: suspicious RCU usage
      | 5.5.0-04005-g8fc91b97-dirty #452 Not tainted
      | -----------------------------
      | include/linux/rcupdate.h:272 Illegal context switch in RCU read-side critical section!
      |
      | other info that might help us debug this:
      |
      |
      | rcu_scheduler_active = 2, debug_locks = 1
      | 4 locks held by kworker/0:1/62:
      |  #0: ffff88813b019748 ((wq_completion)events){+.+.}, at: process_one_work+0x1d7/0x5e0
      |  #1: ffffc900000abe50 ((work_completion)(&map->work)){+.+.}, at: process_one_work+0x1d7/0x5e0
      |  #2: ffffffff82065d20 (rcu_read_lock){....}, at: sock_hash_free+0x5/0x1d0
      |  #3: ffff888139966e00 (&htab->buckets[i].lock){+...}, at: sock_hash_free+0x92/0x1d0
      |
      | stack backtrace:
      | CPU: 0 PID: 62 Comm: kworker/0:1 Not tainted 5.5.0-04005-g8fc91b97-dirty #452
      | Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014
      | Workqueue: events bpf_map_free_deferred
      | Call Trace:
      |  dump_stack+0x71/0xa0
      |  ___might_sleep+0x105/0x190
      |  lock_sock_nested+0x28/0x90
      |  sock_hash_free+0xec/0x1d0
      |  bpf_map_free_deferred+0x58/0x80
      |  process_one_work+0x260/0x5e0
      |  worker_thread+0x4d/0x3e0
      |  kthread+0x108/0x140
      |  ? process_one_work+0x5e0/0x5e0
      |  ? kthread_park+0x90/0x90
      |  ret_from_fork+0x3a/0x50
      
      Fixes: 7e81a353 ("bpf: Sockmap, ensure sock lock held during tear down")
      Signed-off-by: NJakub Sitnicki <jakub@cloudflare.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Link: https://lore.kernel.org/bpf/20200206111652.694507-2-jakub@cloudflare.com
      db6a5018
    • L
      bpf, sockmap: Check update requirements after locking · 85b8ac01
      Lorenz Bauer 提交于
      It's currently possible to insert sockets in unexpected states into
      a sockmap, due to a TOCTTOU when updating the map from a syscall.
      sock_map_update_elem checks that sk->sk_state == TCP_ESTABLISHED,
      locks the socket and then calls sock_map_update_common. At this
      point, the socket may have transitioned into another state, and
      the earlier assumptions don't hold anymore. Crucially, it's
      conceivable (though very unlikely) that a socket has become unhashed.
      This breaks the sockmap's assumption that it will get a callback
      via sk->sk_prot->unhash.
      
      Fix this by checking the (fixed) sk_type and sk_protocol without the
      lock, followed by a locked check of sk_state.
      
      Unfortunately it's not possible to push the check down into
      sock_(map|hash)_update_common, since BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB
      run before the socket has transitioned from TCP_SYN_RECV into
      TCP_ESTABLISHED.
      
      Fixes: 604326b4 ("bpf, sockmap: convert to generic sk_msg interface")
      Signed-off-by: NLorenz Bauer <lmb@cloudflare.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: NJakub Sitnicki <jakub@cloudflare.com>
      Link: https://lore.kernel.org/bpf/20200207103713.28175-1-lmb@cloudflare.com
      85b8ac01
  19. 16 1月, 2020 1 次提交
    • J
      bpf: Sockmap, ensure sock lock held during tear down · 7e81a353
      John Fastabend 提交于
      The sock_map_free() and sock_hash_free() paths used to delete sockmap
      and sockhash maps walk the maps and destroy psock and bpf state associated
      with the socks in the map. When done the socks no longer have BPF programs
      attached and will function normally. This can happen while the socks in
      the map are still "live" meaning data may be sent/received during the walk.
      
      Currently, though we don't take the sock_lock when the psock and bpf state
      is removed through this path. Specifically, this means we can be writing
      into the ops structure pointers such as sendmsg, sendpage, recvmsg, etc.
      while they are also being called from the networking side. This is not
      safe, we never used proper READ_ONCE/WRITE_ONCE semantics here if we
      believed it was safe. Further its not clear to me its even a good idea
      to try and do this on "live" sockets while networking side might also
      be using the socket. Instead of trying to reason about using the socks
      from both sides lets realize that every use case I'm aware of rarely
      deletes maps, in fact kubernetes/Cilium case builds map at init and
      never tears it down except on errors. So lets do the simple fix and
      grab sock lock.
      
      This patch wraps sock deletes from maps in sock lock and adds some
      annotations so we catch any other cases easier.
      
      Fixes: 604326b4 ("bpf, sockmap: convert to generic sk_msg interface")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/bpf/20200111061206.8028-3-john.fastabend@gmail.com
      7e81a353
  20. 05 9月, 2019 1 次提交