1. 24 8月, 2022 9 次提交
  2. 27 4月, 2022 1 次提交
  3. 08 3月, 2022 8 次提交
  4. 14 1月, 2022 1 次提交
  5. 07 1月, 2022 1 次提交
  6. 30 11月, 2021 1 次提交
    • D
      bpf: Fix toctou on read-only map's constant scalar tracking · 84c51e2d
      Daniel Borkmann 提交于
      mainline inclusion
      from mainline-v5.16-rc1
      commit 353050be
      category: bugfix
      bugzilla: 185802 https://gitee.com/openeuler/kernel/issues/I4DDEL
      CVE: CVE-2021-4001
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=353050be4c19e102178ccc05988101887c25ae53
      
      --------------------------------
      
      Commit a23740ec ("bpf: Track contents of read-only maps as scalars") is
      checking whether maps are read-only both from BPF program side and user space
      side, and then, given their content is constant, reading out their data via
      map->ops->map_direct_value_addr() which is then subsequently used as known
      scalar value for the register, that is, it is marked as __mark_reg_known()
      with the read value at verification time. Before a23740ec, the register
      content was marked as an unknown scalar so the verifier could not make any
      assumptions about the map content.
      
      The current implementation however is prone to a TOCTOU race, meaning, the
      value read as known scalar for the register is not guaranteed to be exactly
      the same at a later point when the program is executed, and as such, the
      prior made assumptions of the verifier with regards to the program will be
      invalid which can cause issues such as OOB access, etc.
      
      While the BPF_F_RDONLY_PROG map flag is always fixed and required to be
      specified at map creation time, the map->frozen property is initially set to
      false for the map given the map value needs to be populated, e.g. for global
      data sections. Once complete, the loader "freezes" the map from user space
      such that no subsequent updates/deletes are possible anymore. For the rest
      of the lifetime of the map, this freeze one-time trigger cannot be undone
      anymore after a successful BPF_MAP_FREEZE cmd return. Meaning, any new BPF_*
      cmd calls which would update/delete map entries will be rejected with -EPERM
      since map_get_sys_perms() removes the FMODE_CAN_WRITE permission. This also
      means that pending update/delete map entries must still complete before this
      guarantee is given. This corner case is not an issue for loaders since they
      create and prepare such program private map in successive steps.
      
      However, a malicious user is able to trigger this TOCTOU race in two different
      ways: i) via userfaultfd, and ii) via batched updates. For i) userfaultfd is
      used to expand the competition interval, so that map_update_elem() can modify
      the contents of the map after map_freeze() and bpf_prog_load() were executed.
      This works, because userfaultfd halts the parallel thread which triggered a
      map_update_elem() at the time where we copy key/value from the user buffer and
      this already passed the FMODE_CAN_WRITE capability test given at that time the
      map was not "frozen". Then, the main thread performs the map_freeze() and
      bpf_prog_load(), and once that had completed successfully, the other thread
      is woken up to complete the pending map_update_elem() which then changes the
      map content. For ii) the idea of the batched update is similar, meaning, when
      there are a large number of updates to be processed, it can increase the
      competition interval between the two. It is therefore possible in practice to
      modify the contents of the map after executing map_freeze() and bpf_prog_load().
      
      One way to fix both i) and ii) at the same time is to expand the use of the
      map's map->writecnt. The latter was introduced in fc970227 ("bpf: Add mmap()
      support for BPF_MAP_TYPE_ARRAY") and further refined in 1f6cb19b ("bpf:
      Prevent re-mmap()'ing BPF map as writable for initially r/o mapping") with
      the rationale to make a writable mmap()'ing of a map mutually exclusive with
      read-only freezing. The counter indicates writable mmap() mappings and then
      prevents/fails the freeze operation. Its semantics can be expanded beyond
      just mmap() by generally indicating ongoing write phases. This would essentially
      span any parallel regular and batched flavor of update/delete operation and
      then also have map_freeze() fail with -EBUSY. For the check_mem_access() in
      the verifier we expand upon the bpf_map_is_rdonly() check ensuring that all
      last pending writes have completed via bpf_map_write_active() test. Once the
      map->frozen is set and bpf_map_write_active() indicates a map->writecnt of 0
      only then we are really guaranteed to use the map's data as known constants.
      For map->frozen being set and pending writes in process of still being completed
      we fall back to marking that register as unknown scalar so we don't end up
      making assumptions about it. With this, both TOCTOU reproducers from i) and
      ii) are fixed.
      
      Note that the map->writecnt has been converted into a atomic64 in the fix in
      order to avoid a double freeze_mutex mutex_{un,}lock() pair when updating
      map->writecnt in the various map update/delete BPF_* cmd flavors. Spanning
      the freeze_mutex over entire map update/delete operations in syscall side
      would not be possible due to then causing everything to be serialized.
      Similarly, something like synchronize_rcu() after setting map->frozen to wait
      for update/deletes to complete is not possible either since it would also
      have to span the user copy which can sleep. On the libbpf side, this won't
      break d66562fb ("libbpf: Add BPF object skeleton support") as the
      anonymous mmap()-ed "map initialization image" is remapped as a BPF map-backed
      mmap()-ed memory where for .rodata it's non-writable.
      
      Fixes: a23740ec ("bpf: Track contents of read-only maps as scalars")
      Reported-by: w1tcher.bupt@gmail.com
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      
      conflicts:
          kernel/bpf/syscall.c
      Signed-off-by: NHe Fengqing <hefengqing@huawei.com>
      Reviewed-by: NKuohai Xu <xukuohai@huawei.com>
      Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      84c51e2d
  7. 15 11月, 2021 3 次提交
  8. 19 10月, 2021 1 次提交
  9. 15 10月, 2021 1 次提交
    • J
      bpf: Track subprog poke descriptors correctly and fix use-after-free · de939748
      John Fastabend 提交于
      stable inclusion
      from stable-5.10.53
      commit a9f36bf3613c65cb587c70fac655c775d911409b
      bugzilla: 175574 https://gitee.com/openeuler/kernel/issues/I4DTUX
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=a9f36bf3613c65cb587c70fac655c775d911409b
      
      --------------------------------
      
      commit f263a814 upstream.
      
      Subprograms are calling map_poke_track(), but on program release there is no
      hook to call map_poke_untrack(). However, on program release, the aux memory
      (and poke descriptor table) is freed even though we still have a reference to
      it in the element list of the map aux data. When we run map_poke_run(), we then
      end up accessing free'd memory, triggering KASAN in prog_array_map_poke_run():
      
        [...]
        [  402.824689] BUG: KASAN: use-after-free in prog_array_map_poke_run+0xc2/0x34e
        [  402.824698] Read of size 4 at addr ffff8881905a7940 by task hubble-fgs/4337
        [  402.824705] CPU: 1 PID: 4337 Comm: hubble-fgs Tainted: G          I       5.12.0+ #399
        [  402.824715] Call Trace:
        [  402.824719]  dump_stack+0x93/0xc2
        [  402.824727]  print_address_description.constprop.0+0x1a/0x140
        [  402.824736]  ? prog_array_map_poke_run+0xc2/0x34e
        [  402.824740]  ? prog_array_map_poke_run+0xc2/0x34e
        [  402.824744]  kasan_report.cold+0x7c/0xd8
        [  402.824752]  ? prog_array_map_poke_run+0xc2/0x34e
        [  402.824757]  prog_array_map_poke_run+0xc2/0x34e
        [  402.824765]  bpf_fd_array_map_update_elem+0x124/0x1a0
        [...]
      
      The elements concerned are walked as follows:
      
          for (i = 0; i < elem->aux->size_poke_tab; i++) {
                 poke = &elem->aux->poke_tab[i];
          [...]
      
      The access to size_poke_tab is a 4 byte read, verified by checking offsets
      in the KASAN dump:
      
        [  402.825004] The buggy address belongs to the object at ffff8881905a7800
                       which belongs to the cache kmalloc-1k of size 1024
        [  402.825008] The buggy address is located 320 bytes inside of
                       1024-byte region [ffff8881905a7800, ffff8881905a7c00)
      
      The pahole output of bpf_prog_aux:
      
        struct bpf_prog_aux {
          [...]
          /* --- cacheline 5 boundary (320 bytes) --- */
          u32                        size_poke_tab;        /*   320     4 */
          [...]
      
      In general, subprograms do not necessarily manage their own data structures.
      For example, BTF func_info and linfo are just pointers to the main program
      structure. This allows reference counting and cleanup to be done on the latter
      which simplifies their management a bit. The aux->poke_tab struct, however,
      did not follow this logic. The initial proposed fix for this use-after-free
      bug further embedded poke data tracking into the subprogram with proper
      reference counting. However, Daniel and Alexei questioned why we were treating
      these objects special; I agree, its unnecessary. The fix here removes the per
      subprogram poke table allocation and map tracking and instead simply points
      the aux->poke_tab pointer at the main programs poke table. This way, map
      tracking is simplified to the main program and we do not need to manage them
      per subprogram.
      
      This also means, bpf_prog_free_deferred(), which unwinds the program reference
      counting and kfrees objects, needs to ensure that we don't try to double free
      the poke_tab when free'ing the subprog structures. This is easily solved by
      NULL'ing the poke_tab pointer. The second detail is to ensure that per
      subprogram JIT logic only does fixups on poke_tab[] entries it owns. To do
      this, we add a pointer in the poke structure to point at the subprogram value
      so JITs can easily check while walking the poke_tab structure if the current
      entry belongs to the current program. The aux pointer is stable and therefore
      suitable for such comparison. On the jit_subprogs() error path, we omit
      cleaning up the poke->aux field because these are only ever referenced from
      the JIT side, but on error we will never make it to the JIT, so its fine to
      leave them dangling. Removing these pointers would complicate the error path
      for no reason. However, we do need to untrack all poke descriptors from the
      main program as otherwise they could race with the freeing of JIT memory from
      the subprograms. Lastly, a748c697 ("bpf: propagate poke descriptors to
      subprograms") had an off-by-one on the subprogram instruction index range
      check as it was testing 'insn_idx >= subprog_start && insn_idx <= subprog_end'.
      However, subprog_end is the next subprogram's start instruction.
      
      Fixes: a748c697 ("bpf: propagate poke descriptors to subprograms")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Co-developed-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20210707223848.14580-2-john.fastabend@gmail.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: NWeilong Chen <chenweilong@huawei.com>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      de939748
  10. 03 7月, 2021 1 次提交
  11. 03 6月, 2021 1 次提交
    • A
      bpf: Allow variable-offset stack access · 65b36570
      Andrei Matei 提交于
      stable inclusion
      from stable-5.10.33
      commit f3c4b01689d392373301e6e60d1b02c5b4020afc
      bugzilla: 51834
      CVE: NA
      
      --------------------------------
      
      [ Upstream commit 01f810ac ]
      
      Before this patch, variable offset access to the stack was dissalowed
      for regular instructions, but was allowed for "indirect" accesses (i.e.
      helpers). This patch removes the restriction, allowing reading and
      writing to the stack through stack pointers with variable offsets. This
      makes stack-allocated buffers more usable in programs, and brings stack
      pointers closer to other types of pointers.
      
      The motivation is being able to use stack-allocated buffers for data
      manipulation. When the stack size limit is sufficient, allocating
      buffers on the stack is simpler than per-cpu arrays, or other
      alternatives.
      
      In unpriviledged programs, variable-offset reads and writes are
      disallowed (they were already disallowed for the indirect access case)
      because the speculative execution checking code doesn't support them.
      Additionally, when writing through a variable-offset stack pointer, if
      any pointers are in the accessible range, there's possilibities of later
      leaking pointers because the write cannot be tracked precisely.
      
      Writes with variable offset mark the whole range as initialized, even
      though we don't know which stack slots are actually written. This is in
      order to not reject future reads to these slots. Note that this doesn't
      affect writes done through helpers; like before, helpers need the whole
      stack range to be initialized to begin with.
      All the stack slots are in range are considered scalars after the write;
      variable-offset register spills are not tracked.
      
      For reads, all the stack slots in the variable range needs to be
      initialized (but see above about what writes do), otherwise the read is
      rejected. All register spilled in stack slots that might be read are
      marked as having been read, however reads through such pointers don't do
      register filling; the target register will always be either a scalar or
      a constant zero.
      Signed-off-by: NAndrei Matei <andreimatei1@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20210207011027.676572-2-andreimatei1@gmail.comSigned-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: NWeilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      65b36570
  12. 22 4月, 2021 1 次提交
  13. 19 4月, 2021 1 次提交
  14. 13 4月, 2021 1 次提交
  15. 09 4月, 2021 1 次提交
  16. 12 10月, 2020 1 次提交
    • D
      bpf: Allow for map-in-map with dynamic inner array map entries · 4a8f87e6
      Daniel Borkmann 提交于
      Recent work in f4d05259 ("bpf: Add map_meta_equal map ops") and 134fede4
      ("bpf: Relax max_entries check for most of the inner map types") added support
      for dynamic inner max elements for most map-in-map types. Exceptions were maps
      like array or prog array where the map_gen_lookup() callback uses the maps'
      max_entries field as a constant when emitting instructions.
      
      We recently implemented Maglev consistent hashing into Cilium's load balancer
      which uses map-in-map with an outer map being hash and inner being array holding
      the Maglev backend table for each service. This has been designed this way in
      order to reduce overall memory consumption given the outer hash map allows to
      avoid preallocating a large, flat memory area for all services. Also, the
      number of service mappings is not always known a-priori.
      
      The use case for dynamic inner array map entries is to further reduce memory
      overhead, for example, some services might just have a small number of back
      ends while others could have a large number. Right now the Maglev backend table
      for small and large number of backends would need to have the same inner array
      map entries which adds a lot of unneeded overhead.
      
      Dynamic inner array map entries can be realized by avoiding the inlined code
      generation for their lookup. The lookup will still be efficient since it will
      be calling into array_map_lookup_elem() directly and thus avoiding retpoline.
      The patch adds a BPF_F_INNER_MAP flag to map creation which therefore skips
      inline code generation and relaxes array_map_meta_equal() check to ignore both
      maps' max_entries. This also still allows to have faster lookups for map-in-map
      when BPF_F_INNER_MAP is not specified and hence dynamic max_entries not needed.
      
      Example code generation where inner map is dynamic sized array:
      
        # bpftool p d x i 125
        int handle__sys_enter(void * ctx):
        ; int handle__sys_enter(void *ctx)
           0: (b4) w1 = 0
        ; int key = 0;
           1: (63) *(u32 *)(r10 -4) = r1
           2: (bf) r2 = r10
        ;
           3: (07) r2 += -4
        ; inner_map = bpf_map_lookup_elem(&outer_arr_dyn, &key);
           4: (18) r1 = map[id:468]
           6: (07) r1 += 272
           7: (61) r0 = *(u32 *)(r2 +0)
           8: (35) if r0 >= 0x3 goto pc+5
           9: (67) r0 <<= 3
          10: (0f) r0 += r1
          11: (79) r0 = *(u64 *)(r0 +0)
          12: (15) if r0 == 0x0 goto pc+1
          13: (05) goto pc+1
          14: (b7) r0 = 0
          15: (b4) w6 = -1
        ; if (!inner_map)
          16: (15) if r0 == 0x0 goto pc+6
          17: (bf) r2 = r10
        ;
          18: (07) r2 += -4
        ; val = bpf_map_lookup_elem(inner_map, &key);
          19: (bf) r1 = r0                               | No inlining but instead
          20: (85) call array_map_lookup_elem#149280     | call to array_map_lookup_elem()
        ; return val ? *val : -1;                        | for inner array lookup.
          21: (15) if r0 == 0x0 goto pc+1
        ; return val ? *val : -1;
          22: (61) r6 = *(u32 *)(r0 +0)
        ; }
          23: (bc) w0 = w6
          24: (95) exit
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20201010234006.7075-4-daniel@iogearbox.net
      4a8f87e6
  17. 03 10月, 2020 2 次提交
  18. 30 9月, 2020 2 次提交
  19. 29 9月, 2020 3 次提交
    • A
      bpf: Add bpf_snprintf_btf helper · c4d0bfb4
      Alan Maguire 提交于
      A helper is added to support tracing kernel type information in BPF
      using the BPF Type Format (BTF).  Its signature is
      
      long bpf_snprintf_btf(char *str, u32 str_size, struct btf_ptr *ptr,
      		      u32 btf_ptr_size, u64 flags);
      
      struct btf_ptr * specifies
      
      - a pointer to the data to be traced
      - the BTF id of the type of data pointed to
      - a flags field is provided for future use; these flags
        are not to be confused with the BTF_F_* flags
        below that control how the btf_ptr is displayed; the
        flags member of the struct btf_ptr may be used to
        disambiguate types in kernel versus module BTF, etc;
        the main distinction is the flags relate to the type
        and information needed in identifying it; not how it
        is displayed.
      
      For example a BPF program with a struct sk_buff *skb
      could do the following:
      
      	static struct btf_ptr b = { };
      
      	b.ptr = skb;
      	b.type_id = __builtin_btf_type_id(struct sk_buff, 1);
      	bpf_snprintf_btf(str, sizeof(str), &b, sizeof(b), 0, 0);
      
      Default output looks like this:
      
      (struct sk_buff){
       .transport_header = (__u16)65535,
       .mac_header = (__u16)65535,
       .end = (sk_buff_data_t)192,
       .head = (unsigned char *)0x000000007524fd8b,
       .data = (unsigned char *)0x000000007524fd8b,
       .truesize = (unsigned int)768,
       .users = (refcount_t){
        .refs = (atomic_t){
         .counter = (int)1,
        },
       },
      }
      
      Flags modifying display are as follows:
      
      - BTF_F_COMPACT:	no formatting around type information
      - BTF_F_NONAME:		no struct/union member names/types
      - BTF_F_PTR_RAW:	show raw (unobfuscated) pointer values;
      			equivalent to %px.
      - BTF_F_ZERO:		show zero-valued struct/union members;
      			they are not displayed by default
      Signed-off-by: NAlan Maguire <alan.maguire@oracle.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/1601292670-1616-4-git-send-email-alan.maguire@oracle.com
      c4d0bfb4
    • A
      76654e67
    • T
      bpf: verifier: refactor check_attach_btf_id() · f7b12b6f
      Toke Høiland-Jørgensen 提交于
      The check_attach_btf_id() function really does three things:
      
      1. It performs a bunch of checks on the program to ensure that the
         attachment is valid.
      
      2. It stores a bunch of state about the attachment being requested in
         the verifier environment and struct bpf_prog objects.
      
      3. It allocates a trampoline for the attachment.
      
      This patch splits out (1.) and (3.) into separate functions which will
      perform the checks, but return the computed values instead of directly
      modifying the environment. This is done in preparation for reusing the
      checks when the actual attachment is happening, which will allow tracing
      programs to have multiple (compatible) attachments.
      
      This also fixes a bug where a bunch of checks were skipped if a trampoline
      already existed for the tracing target.
      
      Fixes: 6ba43b76 ("bpf: Attachment verification for BPF_MODIFY_RETURN")
      Fixes: 1e6c62a8 ("bpf: Introduce sleepable BPF programs")
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: NToke Høiland-Jørgensen <toke@redhat.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      f7b12b6f