1. 24 8月, 2022 1 次提交
  2. 27 4月, 2022 1 次提交
  3. 08 3月, 2022 8 次提交
  4. 14 1月, 2022 1 次提交
  5. 07 1月, 2022 1 次提交
  6. 30 11月, 2021 1 次提交
    • D
      bpf: Fix toctou on read-only map's constant scalar tracking · 84c51e2d
      Daniel Borkmann 提交于
      mainline inclusion
      from mainline-v5.16-rc1
      commit 353050be
      category: bugfix
      bugzilla: 185802 https://gitee.com/openeuler/kernel/issues/I4DDEL
      CVE: CVE-2021-4001
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=353050be4c19e102178ccc05988101887c25ae53
      
      --------------------------------
      
      Commit a23740ec ("bpf: Track contents of read-only maps as scalars") is
      checking whether maps are read-only both from BPF program side and user space
      side, and then, given their content is constant, reading out their data via
      map->ops->map_direct_value_addr() which is then subsequently used as known
      scalar value for the register, that is, it is marked as __mark_reg_known()
      with the read value at verification time. Before a23740ec, the register
      content was marked as an unknown scalar so the verifier could not make any
      assumptions about the map content.
      
      The current implementation however is prone to a TOCTOU race, meaning, the
      value read as known scalar for the register is not guaranteed to be exactly
      the same at a later point when the program is executed, and as such, the
      prior made assumptions of the verifier with regards to the program will be
      invalid which can cause issues such as OOB access, etc.
      
      While the BPF_F_RDONLY_PROG map flag is always fixed and required to be
      specified at map creation time, the map->frozen property is initially set to
      false for the map given the map value needs to be populated, e.g. for global
      data sections. Once complete, the loader "freezes" the map from user space
      such that no subsequent updates/deletes are possible anymore. For the rest
      of the lifetime of the map, this freeze one-time trigger cannot be undone
      anymore after a successful BPF_MAP_FREEZE cmd return. Meaning, any new BPF_*
      cmd calls which would update/delete map entries will be rejected with -EPERM
      since map_get_sys_perms() removes the FMODE_CAN_WRITE permission. This also
      means that pending update/delete map entries must still complete before this
      guarantee is given. This corner case is not an issue for loaders since they
      create and prepare such program private map in successive steps.
      
      However, a malicious user is able to trigger this TOCTOU race in two different
      ways: i) via userfaultfd, and ii) via batched updates. For i) userfaultfd is
      used to expand the competition interval, so that map_update_elem() can modify
      the contents of the map after map_freeze() and bpf_prog_load() were executed.
      This works, because userfaultfd halts the parallel thread which triggered a
      map_update_elem() at the time where we copy key/value from the user buffer and
      this already passed the FMODE_CAN_WRITE capability test given at that time the
      map was not "frozen". Then, the main thread performs the map_freeze() and
      bpf_prog_load(), and once that had completed successfully, the other thread
      is woken up to complete the pending map_update_elem() which then changes the
      map content. For ii) the idea of the batched update is similar, meaning, when
      there are a large number of updates to be processed, it can increase the
      competition interval between the two. It is therefore possible in practice to
      modify the contents of the map after executing map_freeze() and bpf_prog_load().
      
      One way to fix both i) and ii) at the same time is to expand the use of the
      map's map->writecnt. The latter was introduced in fc970227 ("bpf: Add mmap()
      support for BPF_MAP_TYPE_ARRAY") and further refined in 1f6cb19b ("bpf:
      Prevent re-mmap()'ing BPF map as writable for initially r/o mapping") with
      the rationale to make a writable mmap()'ing of a map mutually exclusive with
      read-only freezing. The counter indicates writable mmap() mappings and then
      prevents/fails the freeze operation. Its semantics can be expanded beyond
      just mmap() by generally indicating ongoing write phases. This would essentially
      span any parallel regular and batched flavor of update/delete operation and
      then also have map_freeze() fail with -EBUSY. For the check_mem_access() in
      the verifier we expand upon the bpf_map_is_rdonly() check ensuring that all
      last pending writes have completed via bpf_map_write_active() test. Once the
      map->frozen is set and bpf_map_write_active() indicates a map->writecnt of 0
      only then we are really guaranteed to use the map's data as known constants.
      For map->frozen being set and pending writes in process of still being completed
      we fall back to marking that register as unknown scalar so we don't end up
      making assumptions about it. With this, both TOCTOU reproducers from i) and
      ii) are fixed.
      
      Note that the map->writecnt has been converted into a atomic64 in the fix in
      order to avoid a double freeze_mutex mutex_{un,}lock() pair when updating
      map->writecnt in the various map update/delete BPF_* cmd flavors. Spanning
      the freeze_mutex over entire map update/delete operations in syscall side
      would not be possible due to then causing everything to be serialized.
      Similarly, something like synchronize_rcu() after setting map->frozen to wait
      for update/deletes to complete is not possible either since it would also
      have to span the user copy which can sleep. On the libbpf side, this won't
      break d66562fb ("libbpf: Add BPF object skeleton support") as the
      anonymous mmap()-ed "map initialization image" is remapped as a BPF map-backed
      mmap()-ed memory where for .rodata it's non-writable.
      
      Fixes: a23740ec ("bpf: Track contents of read-only maps as scalars")
      Reported-by: w1tcher.bupt@gmail.com
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      
      conflicts:
          kernel/bpf/syscall.c
      Signed-off-by: NHe Fengqing <hefengqing@huawei.com>
      Reviewed-by: NKuohai Xu <xukuohai@huawei.com>
      Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      84c51e2d
  7. 15 11月, 2021 3 次提交
  8. 19 10月, 2021 1 次提交
  9. 15 10月, 2021 1 次提交
    • J
      bpf: Track subprog poke descriptors correctly and fix use-after-free · de939748
      John Fastabend 提交于
      stable inclusion
      from stable-5.10.53
      commit a9f36bf3613c65cb587c70fac655c775d911409b
      bugzilla: 175574 https://gitee.com/openeuler/kernel/issues/I4DTUX
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=a9f36bf3613c65cb587c70fac655c775d911409b
      
      --------------------------------
      
      commit f263a814 upstream.
      
      Subprograms are calling map_poke_track(), but on program release there is no
      hook to call map_poke_untrack(). However, on program release, the aux memory
      (and poke descriptor table) is freed even though we still have a reference to
      it in the element list of the map aux data. When we run map_poke_run(), we then
      end up accessing free'd memory, triggering KASAN in prog_array_map_poke_run():
      
        [...]
        [  402.824689] BUG: KASAN: use-after-free in prog_array_map_poke_run+0xc2/0x34e
        [  402.824698] Read of size 4 at addr ffff8881905a7940 by task hubble-fgs/4337
        [  402.824705] CPU: 1 PID: 4337 Comm: hubble-fgs Tainted: G          I       5.12.0+ #399
        [  402.824715] Call Trace:
        [  402.824719]  dump_stack+0x93/0xc2
        [  402.824727]  print_address_description.constprop.0+0x1a/0x140
        [  402.824736]  ? prog_array_map_poke_run+0xc2/0x34e
        [  402.824740]  ? prog_array_map_poke_run+0xc2/0x34e
        [  402.824744]  kasan_report.cold+0x7c/0xd8
        [  402.824752]  ? prog_array_map_poke_run+0xc2/0x34e
        [  402.824757]  prog_array_map_poke_run+0xc2/0x34e
        [  402.824765]  bpf_fd_array_map_update_elem+0x124/0x1a0
        [...]
      
      The elements concerned are walked as follows:
      
          for (i = 0; i < elem->aux->size_poke_tab; i++) {
                 poke = &elem->aux->poke_tab[i];
          [...]
      
      The access to size_poke_tab is a 4 byte read, verified by checking offsets
      in the KASAN dump:
      
        [  402.825004] The buggy address belongs to the object at ffff8881905a7800
                       which belongs to the cache kmalloc-1k of size 1024
        [  402.825008] The buggy address is located 320 bytes inside of
                       1024-byte region [ffff8881905a7800, ffff8881905a7c00)
      
      The pahole output of bpf_prog_aux:
      
        struct bpf_prog_aux {
          [...]
          /* --- cacheline 5 boundary (320 bytes) --- */
          u32                        size_poke_tab;        /*   320     4 */
          [...]
      
      In general, subprograms do not necessarily manage their own data structures.
      For example, BTF func_info and linfo are just pointers to the main program
      structure. This allows reference counting and cleanup to be done on the latter
      which simplifies their management a bit. The aux->poke_tab struct, however,
      did not follow this logic. The initial proposed fix for this use-after-free
      bug further embedded poke data tracking into the subprogram with proper
      reference counting. However, Daniel and Alexei questioned why we were treating
      these objects special; I agree, its unnecessary. The fix here removes the per
      subprogram poke table allocation and map tracking and instead simply points
      the aux->poke_tab pointer at the main programs poke table. This way, map
      tracking is simplified to the main program and we do not need to manage them
      per subprogram.
      
      This also means, bpf_prog_free_deferred(), which unwinds the program reference
      counting and kfrees objects, needs to ensure that we don't try to double free
      the poke_tab when free'ing the subprog structures. This is easily solved by
      NULL'ing the poke_tab pointer. The second detail is to ensure that per
      subprogram JIT logic only does fixups on poke_tab[] entries it owns. To do
      this, we add a pointer in the poke structure to point at the subprogram value
      so JITs can easily check while walking the poke_tab structure if the current
      entry belongs to the current program. The aux pointer is stable and therefore
      suitable for such comparison. On the jit_subprogs() error path, we omit
      cleaning up the poke->aux field because these are only ever referenced from
      the JIT side, but on error we will never make it to the JIT, so its fine to
      leave them dangling. Removing these pointers would complicate the error path
      for no reason. However, we do need to untrack all poke descriptors from the
      main program as otherwise they could race with the freeing of JIT memory from
      the subprograms. Lastly, a748c697 ("bpf: propagate poke descriptors to
      subprograms") had an off-by-one on the subprogram instruction index range
      check as it was testing 'insn_idx >= subprog_start && insn_idx <= subprog_end'.
      However, subprog_end is the next subprogram's start instruction.
      
      Fixes: a748c697 ("bpf: propagate poke descriptors to subprograms")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Co-developed-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20210707223848.14580-2-john.fastabend@gmail.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: NWeilong Chen <chenweilong@huawei.com>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      de939748
  10. 03 7月, 2021 1 次提交
  11. 03 6月, 2021 1 次提交
    • A
      bpf: Allow variable-offset stack access · 65b36570
      Andrei Matei 提交于
      stable inclusion
      from stable-5.10.33
      commit f3c4b01689d392373301e6e60d1b02c5b4020afc
      bugzilla: 51834
      CVE: NA
      
      --------------------------------
      
      [ Upstream commit 01f810ac ]
      
      Before this patch, variable offset access to the stack was dissalowed
      for regular instructions, but was allowed for "indirect" accesses (i.e.
      helpers). This patch removes the restriction, allowing reading and
      writing to the stack through stack pointers with variable offsets. This
      makes stack-allocated buffers more usable in programs, and brings stack
      pointers closer to other types of pointers.
      
      The motivation is being able to use stack-allocated buffers for data
      manipulation. When the stack size limit is sufficient, allocating
      buffers on the stack is simpler than per-cpu arrays, or other
      alternatives.
      
      In unpriviledged programs, variable-offset reads and writes are
      disallowed (they were already disallowed for the indirect access case)
      because the speculative execution checking code doesn't support them.
      Additionally, when writing through a variable-offset stack pointer, if
      any pointers are in the accessible range, there's possilibities of later
      leaking pointers because the write cannot be tracked precisely.
      
      Writes with variable offset mark the whole range as initialized, even
      though we don't know which stack slots are actually written. This is in
      order to not reject future reads to these slots. Note that this doesn't
      affect writes done through helpers; like before, helpers need the whole
      stack range to be initialized to begin with.
      All the stack slots are in range are considered scalars after the write;
      variable-offset register spills are not tracked.
      
      For reads, all the stack slots in the variable range needs to be
      initialized (but see above about what writes do), otherwise the read is
      rejected. All register spilled in stack slots that might be read are
      marked as having been read, however reads through such pointers don't do
      register filling; the target register will always be either a scalar or
      a constant zero.
      Signed-off-by: NAndrei Matei <andreimatei1@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20210207011027.676572-2-andreimatei1@gmail.comSigned-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: NWeilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      65b36570
  12. 22 4月, 2021 1 次提交
  13. 19 4月, 2021 1 次提交
  14. 13 4月, 2021 1 次提交
  15. 09 4月, 2021 1 次提交
  16. 12 10月, 2020 1 次提交
    • D
      bpf: Allow for map-in-map with dynamic inner array map entries · 4a8f87e6
      Daniel Borkmann 提交于
      Recent work in f4d05259 ("bpf: Add map_meta_equal map ops") and 134fede4
      ("bpf: Relax max_entries check for most of the inner map types") added support
      for dynamic inner max elements for most map-in-map types. Exceptions were maps
      like array or prog array where the map_gen_lookup() callback uses the maps'
      max_entries field as a constant when emitting instructions.
      
      We recently implemented Maglev consistent hashing into Cilium's load balancer
      which uses map-in-map with an outer map being hash and inner being array holding
      the Maglev backend table for each service. This has been designed this way in
      order to reduce overall memory consumption given the outer hash map allows to
      avoid preallocating a large, flat memory area for all services. Also, the
      number of service mappings is not always known a-priori.
      
      The use case for dynamic inner array map entries is to further reduce memory
      overhead, for example, some services might just have a small number of back
      ends while others could have a large number. Right now the Maglev backend table
      for small and large number of backends would need to have the same inner array
      map entries which adds a lot of unneeded overhead.
      
      Dynamic inner array map entries can be realized by avoiding the inlined code
      generation for their lookup. The lookup will still be efficient since it will
      be calling into array_map_lookup_elem() directly and thus avoiding retpoline.
      The patch adds a BPF_F_INNER_MAP flag to map creation which therefore skips
      inline code generation and relaxes array_map_meta_equal() check to ignore both
      maps' max_entries. This also still allows to have faster lookups for map-in-map
      when BPF_F_INNER_MAP is not specified and hence dynamic max_entries not needed.
      
      Example code generation where inner map is dynamic sized array:
      
        # bpftool p d x i 125
        int handle__sys_enter(void * ctx):
        ; int handle__sys_enter(void *ctx)
           0: (b4) w1 = 0
        ; int key = 0;
           1: (63) *(u32 *)(r10 -4) = r1
           2: (bf) r2 = r10
        ;
           3: (07) r2 += -4
        ; inner_map = bpf_map_lookup_elem(&outer_arr_dyn, &key);
           4: (18) r1 = map[id:468]
           6: (07) r1 += 272
           7: (61) r0 = *(u32 *)(r2 +0)
           8: (35) if r0 >= 0x3 goto pc+5
           9: (67) r0 <<= 3
          10: (0f) r0 += r1
          11: (79) r0 = *(u64 *)(r0 +0)
          12: (15) if r0 == 0x0 goto pc+1
          13: (05) goto pc+1
          14: (b7) r0 = 0
          15: (b4) w6 = -1
        ; if (!inner_map)
          16: (15) if r0 == 0x0 goto pc+6
          17: (bf) r2 = r10
        ;
          18: (07) r2 += -4
        ; val = bpf_map_lookup_elem(inner_map, &key);
          19: (bf) r1 = r0                               | No inlining but instead
          20: (85) call array_map_lookup_elem#149280     | call to array_map_lookup_elem()
        ; return val ? *val : -1;                        | for inner array lookup.
          21: (15) if r0 == 0x0 goto pc+1
        ; return val ? *val : -1;
          22: (61) r6 = *(u32 *)(r0 +0)
        ; }
          23: (bc) w0 = w6
          24: (95) exit
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20201010234006.7075-4-daniel@iogearbox.net
      4a8f87e6
  17. 03 10月, 2020 2 次提交
  18. 30 9月, 2020 2 次提交
  19. 29 9月, 2020 5 次提交
  20. 26 9月, 2020 2 次提交
    • J
      bpf: Add comment to document BTF type PTR_TO_BTF_ID_OR_NULL · ba5f4cfe
      John Fastabend 提交于
      The meaning of PTR_TO_BTF_ID_OR_NULL differs slightly from other types
      denoted with the *_OR_NULL type. For example the types PTR_TO_SOCKET
      and PTR_TO_SOCKET_OR_NULL can be used for branch analysis because the
      type PTR_TO_SOCKET is guaranteed to _not_ have a null value.
      
      In contrast PTR_TO_BTF_ID and BTF_TO_BTF_ID_OR_NULL have slightly
      different meanings. A PTR_TO_BTF_TO_ID may be a pointer to NULL value,
      but it is safe to read this pointer in the program context because
      the program context will handle any faults. The fallout is for
      PTR_TO_BTF_ID the verifier can assume reads are safe, but can not
      use the type in branch analysis. Additionally, authors need to be
      extra careful when passing PTR_TO_BTF_ID into helpers. In general
      helpers consuming type PTR_TO_BTF_ID will need to assume it may
      be null.
      
      Seeing the above is not obvious to readers without the back knowledge
      lets add a comment in the type definition.
      
      Editorial comment, as networking and tracing programs get closer
      and more tightly merged we may need to consider a new type that we
      can ensure is non-null for branch analysis and also passing into
      helpers.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NLorenz Bauer <lmb@cloudflare.com>
      ba5f4cfe
    • M
      bpf: Enable bpf_skc_to_* sock casting helper to networking prog type · 1df8f55a
      Martin KaFai Lau 提交于
      There is a constant need to add more fields into the bpf_tcp_sock
      for the bpf programs running at tc, sock_ops...etc.
      
      A current workaround could be to use bpf_probe_read_kernel().  However,
      other than making another helper call for reading each field and missing
      CO-RE, it is also not as intuitive to use as directly reading
      "tp->lsndtime" for example.  While already having perfmon cap to do
      bpf_probe_read_kernel(), it will be much easier if the bpf prog can
      directly read from the tcp_sock.
      
      This patch tries to do that by using the existing casting-helpers
      bpf_skc_to_*() whose func_proto returns a btf_id.  For example, the
      func_proto of bpf_skc_to_tcp_sock returns the btf_id of the
      kernel "struct tcp_sock".
      
      These helpers are also added to is_ptr_cast_function().
      It ensures the returning reg (BPF_REF_0) will also carries the ref_obj_id.
      That will keep the ref-tracking works properly.
      
      The bpf_skc_to_* helpers are made available to most of the bpf prog
      types in filter.c. The bpf_skc_to_* helpers will be limited by
      perfmon cap.
      
      This patch adds a ARG_PTR_TO_BTF_ID_SOCK_COMMON.  The helper accepting
      this arg can accept a btf-id-ptr (PTR_TO_BTF_ID + &btf_sock_ids[BTF_SOCK_TYPE_SOCK_COMMON])
      or a legacy-ctx-convert-skc-ptr (PTR_TO_SOCK_COMMON).  The bpf_skc_to_*()
      helpers are changed to take ARG_PTR_TO_BTF_ID_SOCK_COMMON such that
      they will accept pointer obtained from skb->sk.
      
      Instead of specifying both arg_type and arg_btf_id in the same func_proto
      which is how the current ARG_PTR_TO_BTF_ID does, the arg_btf_id of
      the new ARG_PTR_TO_BTF_ID_SOCK_COMMON is specified in the
      compatible_reg_types[] in verifier.c.  The reason is the arg_btf_id is
      always the same.  Discussion in this thread:
      https://lore.kernel.org/bpf/20200922070422.1917351-1-kafai@fb.com/
      
      The ARG_PTR_TO_BTF_ID_ part gives a clear expectation that the helper is
      expecting a PTR_TO_BTF_ID which could be NULL.  This is the same
      behavior as the existing helper taking ARG_PTR_TO_BTF_ID.
      
      The _SOCK_COMMON part means the helper is also expecting the legacy
      SOCK_COMMON pointer.
      
      By excluding the _OR_NULL part, the bpf prog cannot call helper
      with a literal NULL which doesn't make sense in most cases.
      e.g. bpf_skc_to_tcp_sock(NULL) will be rejected.  All PTR_TO_*_OR_NULL
      reg has to do a NULL check first before passing into the helper or else
      the bpf prog will be rejected.  This behavior is nothing new and
      consistent with the current expectation during bpf-prog-load.
      
      [ ARG_PTR_TO_BTF_ID_SOCK_COMMON will be used to replace
        ARG_PTR_TO_SOCK* of other existing helpers later such that
        those existing helpers can take the PTR_TO_BTF_ID returned by
        the bpf_skc_to_*() helpers.
      
        The only special case is bpf_sk_lookup_assign() which can accept a
        literal NULL ptr.  It has to be handled specially in another follow
        up patch if there is a need (e.g. by renaming ARG_PTR_TO_SOCKET_OR_NULL
        to ARG_PTR_TO_BTF_ID_SOCK_COMMON_OR_NULL). ]
      
      [ When converting the older helpers that take ARG_PTR_TO_SOCK* in
        the later patch, if the kernel does not support BTF,
        ARG_PTR_TO_BTF_ID_SOCK_COMMON will behave like ARG_PTR_TO_SOCK_COMMON
        because no reg->type could have PTR_TO_BTF_ID in this case.
      
        It is not a concern for the newer-btf-only helper like the bpf_skc_to_*()
        here though because these helpers must require BTF vmlinux to begin
        with. ]
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Link: https://lore.kernel.org/bpf/20200925000350.3855720-1-kafai@fb.com
      1df8f55a
  21. 22 9月, 2020 3 次提交
  22. 18 9月, 2020 1 次提交
    • M
      bpf, x64: rework pro/epilogue and tailcall handling in JIT · ebf7d1f5
      Maciej Fijalkowski 提交于
      This commit serves two things:
      1) it optimizes BPF prologue/epilogue generation
      2) it makes possible to have tailcalls within BPF subprogram
      
      Both points are related to each other since without 1), 2) could not be
      achieved.
      
      In [1], Alexei says:
      "The prologue will look like:
      nop5
      xor eax,eax  // two new bytes if bpf_tail_call() is used in this
                   // function
      push rbp
      mov rbp, rsp
      sub rsp, rounded_stack_depth
      push rax // zero init tail_call counter
      variable number of push rbx,r13,r14,r15
      
      Then bpf_tail_call will pop variable number rbx,..
      and final 'pop rax'
      Then 'add rsp, size_of_current_stack_frame'
      jmp to next function and skip over 'nop5; xor eax,eax; push rpb; mov
      rbp, rsp'
      
      This way new function will set its own stack size and will init tail
      call
      counter with whatever value the parent had.
      
      If next function doesn't use bpf_tail_call it won't have 'xor eax,eax'.
      Instead it would need to have 'nop2' in there."
      
      Implement that suggestion.
      
      Since the layout of stack is changed, tail call counter handling can not
      rely anymore on popping it to rbx just like it have been handled for
      constant prologue case and later overwrite of rbx with actual value of
      rbx pushed to stack. Therefore, let's use one of the register (%rcx) that
      is considered to be volatile/caller-saved and pop the value of tail call
      counter in there in the epilogue.
      
      Drop the BUILD_BUG_ON in emit_prologue and in
      emit_bpf_tail_call_indirect where instruction layout is not constant
      anymore.
      
      Introduce new poke target, 'tailcall_bypass' to poke descriptor that is
      dedicated for skipping the register pops and stack unwind that are
      generated right before the actual jump to target program.
      For case when the target program is not present, BPF program will skip
      the pop instructions and nop5 dedicated for jmpq $target. An example of
      such state when only R6 of callee saved registers is used by program:
      
      ffffffffc0513aa1:       e9 0e 00 00 00          jmpq   0xffffffffc0513ab4
      ffffffffc0513aa6:       5b                      pop    %rbx
      ffffffffc0513aa7:       58                      pop    %rax
      ffffffffc0513aa8:       48 81 c4 00 00 00 00    add    $0x0,%rsp
      ffffffffc0513aaf:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
      ffffffffc0513ab4:       48 89 df                mov    %rbx,%rdi
      
      When target program is inserted, the jump that was there to skip
      pops/nop5 will become the nop5, so CPU will go over pops and do the
      actual tailcall.
      
      One might ask why there simply can not be pushes after the nop5?
      In the following example snippet:
      
      ffffffffc037030c:       48 89 fb                mov    %rdi,%rbx
      (...)
      ffffffffc0370332:       5b                      pop    %rbx
      ffffffffc0370333:       58                      pop    %rax
      ffffffffc0370334:       48 81 c4 00 00 00 00    add    $0x0,%rsp
      ffffffffc037033b:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
      ffffffffc0370340:       48 81 ec 00 00 00 00    sub    $0x0,%rsp
      ffffffffc0370347:       50                      push   %rax
      ffffffffc0370348:       53                      push   %rbx
      ffffffffc0370349:       48 89 df                mov    %rbx,%rdi
      ffffffffc037034c:       e8 f7 21 00 00          callq  0xffffffffc0372548
      
      There is the bpf2bpf call (at ffffffffc037034c) right after the tailcall
      and jump target is not present. ctx is in %rbx register and BPF
      subprogram that we will call into on ffffffffc037034c is relying on it,
      e.g. it will pick ctx from there. Such code layout is therefore broken
      as we would overwrite the content of %rbx with the value that was pushed
      on the prologue. That is the reason for the 'bypass' approach.
      
      Special care needs to be taken during the install/update/remove of
      tailcall target. In case when target program is not present, the CPU
      must not execute the pop instructions that precede the tailcall.
      
      To address that, the following states can be defined:
      A nop, unwind, nop
      B nop, unwind, tail
      C skip, unwind, nop
      D skip, unwind, tail
      
      A is forbidden (lead to incorrectness). The state transitions between
      tailcall install/update/remove will work as follows:
      
      First install tail call f: C->D->B(f)
       * poke the tailcall, after that get rid of the skip
      Update tail call f to f': B(f)->B(f')
       * poke the tailcall (poke->tailcall_target) and do NOT touch the
         poke->tailcall_bypass
      Remove tail call: B(f')->C(f')
       * poke->tailcall_bypass is poked back to jump, then we wait the RCU
         grace period so that other programs will finish its execution and
         after that we are safe to remove the poke->tailcall_target
      Install new tail call (f''): C(f')->D(f'')->B(f'').
       * same as first step
      
      This way CPU can never be exposed to "unwind, tail" state.
      
      Last but not least, when tailcalls get mixed with bpf2bpf calls, it
      would be possible to encounter the endless loop due to clearing the
      tailcall counter if for example we would use the tailcall3-like from BPF
      selftests program that would be subprogram-based, meaning the tailcall
      would be present within the BPF subprogram.
      
      This test, broken down to particular steps, would do:
      entry -> set tailcall counter to 0, bump it by 1, tailcall to func0
      func0 -> call subprog_tail
      (we are NOT skipping the first 11 bytes of prologue and this subprogram
      has a tailcall, therefore we clear the counter...)
      subprog -> do the same thing as entry
      
      and then loop forever.
      
      To address this, the idea is to go through the call chain of bpf2bpf progs
      and look for a tailcall presence throughout whole chain. If we saw a single
      tail call then each node in this call chain needs to be marked as a subprog
      that can reach the tailcall. We would later feed the JIT with this info
      and:
      - set eax to 0 only when tailcall is reachable and this is the entry prog
      - if tailcall is reachable but there's no tailcall in insns of currently
        JITed prog then push rax anyway, so that it will be possible to
        propagate further down the call chain
      - finally if tailcall is reachable, then we need to precede the 'call'
        insn with mov rax, [rbp - (stack_depth + 8)]
      
      Tail call related cases from test_verifier kselftest are also working
      fine. Sample BPF programs that utilize tail calls (sockex3, tracex5)
      work properly as well.
      
      [1]: https://lore.kernel.org/bpf/20200517043227.2gpq22ifoq37ogst@ast-mbp.dhcp.thefacebook.com/Suggested-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      ebf7d1f5