1. 03 5月, 2021 1 次提交
    • D
      bpf: Fix leakage of uninitialized bpf stack under speculation · 801c6058
      Daniel Borkmann 提交于
      The current implemented mechanisms to mitigate data disclosure under
      speculation mainly address stack and map value oob access from the
      speculative domain. However, Piotr discovered that uninitialized BPF
      stack is not protected yet, and thus old data from the kernel stack,
      potentially including addresses of kernel structures, could still be
      extracted from that 512 bytes large window. The BPF stack is special
      compared to map values since it's not zero initialized for every
      program invocation, whereas map values /are/ zero initialized upon
      their initial allocation and thus cannot leak any prior data in either
      domain. In the non-speculative domain, the verifier ensures that every
      stack slot read must have a prior stack slot write by the BPF program
      to avoid such data leaking issue.
      
      However, this is not enough: for example, when the pointer arithmetic
      operation moves the stack pointer from the last valid stack offset to
      the first valid offset, the sanitation logic allows for any intermediate
      offsets during speculative execution, which could then be used to
      extract any restricted stack content via side-channel.
      
      Given for unprivileged stack pointer arithmetic the use of unknown
      but bounded scalars is generally forbidden, we can simply turn the
      register-based arithmetic operation into an immediate-based arithmetic
      operation without the need for masking. This also gives the benefit
      of reducing the needed instructions for the operation. Given after
      the work in 7fedb63a ("bpf: Tighten speculative pointer arithmetic
      mask"), the aux->alu_limit already holds the final immediate value for
      the offset register with the known scalar. Thus, a simple mov of the
      immediate to AX register with using AX as the source for the original
      instruction is sufficient and possible now in this case.
      Reported-by: NPiotr Krysiuk <piotras@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Tested-by: NPiotr Krysiuk <piotras@gmail.com>
      Reviewed-by: NPiotr Krysiuk <piotras@gmail.com>
      Reviewed-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      801c6058
  2. 14 4月, 2021 1 次提交
  3. 27 2月, 2021 1 次提交
    • Y
      bpf: Add bpf_for_each_map_elem() helper · 69c087ba
      Yonghong Song 提交于
      The bpf_for_each_map_elem() helper is introduced which
      iterates all map elements with a callback function. The
      helper signature looks like
        long bpf_for_each_map_elem(map, callback_fn, callback_ctx, flags)
      and for each map element, the callback_fn will be called. For example,
      like hashmap, the callback signature may look like
        long callback_fn(map, key, val, callback_ctx)
      
      There are two known use cases for this. One is from upstream ([1]) where
      a for_each_map_elem helper may help implement a timeout mechanism
      in a more generic way. Another is from our internal discussion
      for a firewall use case where a map contains all the rules. The packet
      data can be compared to all these rules to decide allow or deny
      the packet.
      
      For array maps, users can already use a bounded loop to traverse
      elements. Using this helper can avoid using bounded loop. For other
      type of maps (e.g., hash maps) where bounded loop is hard or
      impossible to use, this helper provides a convenient way to
      operate on all elements.
      
      For callback_fn, besides map and map element, a callback_ctx,
      allocated on caller stack, is also passed to the callback
      function. This callback_ctx argument can provide additional
      input and allow to write to caller stack for output.
      
      If the callback_fn returns 0, the helper will iterate through next
      element if available. If the callback_fn returns 1, the helper
      will stop iterating and returns to the bpf program. Other return
      values are not used for now.
      
      Currently, this helper is only available with jit. It is possible
      to make it work with interpreter with so effort but I leave it
      as the future work.
      
      [1]: https://lore.kernel.org/bpf/20210122205415.113822-1-xiyou.wangcong@gmail.com/Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20210226204925.3884923-1-yhs@fb.com
      69c087ba
  4. 13 2月, 2021 1 次提交
    • D
      bpf: Support pointers in global func args · e5069b9c
      Dmitrii Banshchikov 提交于
      Add an ability to pass a pointer to a type with known size in arguments
      of a global function. Such pointers may be used to overcome the limit on
      the maximum number of arguments, avoid expensive and tricky workarounds
      and to have multiple output arguments.
      
      A referenced type may contain pointers but indirect access through them
      isn't supported.
      
      The implementation consists of two parts.  If a global function has an
      argument that is a pointer to a type with known size then:
      
        1) In btf_check_func_arg_match(): check that the corresponding
      register points to NULL or to a valid memory region that is large enough
      to contain the expected argument's type.
      
        2) In btf_prepare_func_args(): set the corresponding register type to
      PTR_TO_MEM_OR_NULL and its size to the size of the expected type.
      
      Only global functions are supported because allowance of pointers for
      static functions might break validation. Consider the following
      scenario. A static function has a pointer argument. A caller passes
      pointer to its stack memory. Because the callee can change referenced
      memory verifier cannot longer assume any particular slot type of the
      caller's stack memory hence the slot type is changed to SLOT_MISC.  If
      there is an operation that relies on slot type other than SLOT_MISC then
      verifier won't be able to infer safety of the operation.
      
      When verifier sees a static function that has a pointer argument
      different from PTR_TO_CTX then it skips arguments check and continues
      with "inline" validation with more information available. The operation
      that relies on the particular slot type now succeeds.
      
      Because global functions were not allowed to have pointer arguments
      different from PTR_TO_CTX it's not possible to break existing and valid
      code.
      Signed-off-by: NDmitrii Banshchikov <me@ubique.spb.ru>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20210212205642.620788-4-me@ubique.spb.ru
      e5069b9c
  5. 11 2月, 2021 1 次提交
    • A
      bpf: Allow variable-offset stack access · 01f810ac
      Andrei Matei 提交于
      Before this patch, variable offset access to the stack was dissalowed
      for regular instructions, but was allowed for "indirect" accesses (i.e.
      helpers). This patch removes the restriction, allowing reading and
      writing to the stack through stack pointers with variable offsets. This
      makes stack-allocated buffers more usable in programs, and brings stack
      pointers closer to other types of pointers.
      
      The motivation is being able to use stack-allocated buffers for data
      manipulation. When the stack size limit is sufficient, allocating
      buffers on the stack is simpler than per-cpu arrays, or other
      alternatives.
      
      In unpriviledged programs, variable-offset reads and writes are
      disallowed (they were already disallowed for the indirect access case)
      because the speculative execution checking code doesn't support them.
      Additionally, when writing through a variable-offset stack pointer, if
      any pointers are in the accessible range, there's possilibities of later
      leaking pointers because the write cannot be tracked precisely.
      
      Writes with variable offset mark the whole range as initialized, even
      though we don't know which stack slots are actually written. This is in
      order to not reject future reads to these slots. Note that this doesn't
      affect writes done through helpers; like before, helpers need the whole
      stack range to be initialized to begin with.
      All the stack slots are in range are considered scalars after the write;
      variable-offset register spills are not tracked.
      
      For reads, all the stack slots in the variable range needs to be
      initialized (but see above about what writes do), otherwise the read is
      rejected. All register spilled in stack slots that might be read are
      marked as having been read, however reads through such pointers don't do
      register filling; the target register will always be either a scalar or
      a constant zero.
      Signed-off-by: NAndrei Matei <andreimatei1@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20210207011027.676572-2-andreimatei1@gmail.com
      01f810ac
  6. 13 1月, 2021 1 次提交
    • A
      bpf: Support BPF ksym variables in kernel modules · 541c3bad
      Andrii Nakryiko 提交于
      Add support for directly accessing kernel module variables from BPF programs
      using special ldimm64 instructions. This functionality builds upon vmlinux
      ksym support, but extends ldimm64 with src_reg=BPF_PSEUDO_BTF_ID to allow
      specifying kernel module BTF's FD in insn[1].imm field.
      
      During BPF program load time, verifier will resolve FD to BTF object and will
      take reference on BTF object itself and, for module BTFs, corresponding module
      as well, to make sure it won't be unloaded from under running BPF program. The
      mechanism used is similar to how bpf_prog keeps track of used bpf_maps.
      
      One interesting change is also in how per-CPU variable is determined. The
      logic is to find .data..percpu data section in provided BTF, but both vmlinux
      and module each have their own .data..percpu entries in BTF. So for module's
      case, the search for DATASEC record needs to look at only module's added BTF
      types. This is implemented with custom search function.
      Signed-off-by: NAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NYonghong Song <yhs@fb.com>
      Acked-by: NHao Luo <haoluo@google.com>
      Link: https://lore.kernel.org/bpf/20210112075520.4103414-6-andrii@kernel.org
      541c3bad
  7. 04 12月, 2020 1 次提交
    • A
      bpf: Remove hard-coded btf_vmlinux assumption from BPF verifier · 22dc4a0f
      Andrii Nakryiko 提交于
      Remove a permeating assumption thoughout BPF verifier of vmlinux BTF. Instead,
      wherever BTF type IDs are involved, also track the instance of struct btf that
      goes along with the type ID. This allows to gradually add support for kernel
      module BTFs and using/tracking module types across BPF helper calls and
      registers.
      
      This patch also renames btf_id() function to btf_obj_id() to minimize naming
      clash with using btf_id to denote BTF *type* ID, rather than BTF *object*'s ID.
      
      Also, altough btf_vmlinux can't get destructed and thus doesn't need
      refcounting, module BTFs need that, so apply BTF refcounting universally when
      BPF program is using BTF-powered attachment (tp_btf, fentry/fexit, etc). This
      makes for simpler clean up code.
      
      Now that BTF type ID is not enough to uniquely identify a BTF type, extend BPF
      trampoline key to include BTF object ID. To differentiate that from target
      program BPF ID, set 31st bit of type ID. BTF type IDs (at least currently) are
      not allowed to take full 32 bits, so there is no danger of confusing that bit
      with a valid BTF type ID.
      Signed-off-by: NAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20201203204634.1325171-10-andrii@kernel.org
      22dc4a0f
  8. 13 11月, 2020 1 次提交
  9. 03 10月, 2020 1 次提交
    • H
      bpf: Introduce pseudo_btf_id · 4976b718
      Hao Luo 提交于
      Pseudo_btf_id is a type of ld_imm insn that associates a btf_id to a
      ksym so that further dereferences on the ksym can use the BTF info
      to validate accesses. Internally, when seeing a pseudo_btf_id ld insn,
      the verifier reads the btf_id stored in the insn[0]'s imm field and
      marks the dst_reg as PTR_TO_BTF_ID. The btf_id points to a VAR_KIND,
      which is encoded in btf_vminux by pahole. If the VAR is not of a struct
      type, the dst reg will be marked as PTR_TO_MEM instead of PTR_TO_BTF_ID
      and the mem_size is resolved to the size of the VAR's type.
      
      >From the VAR btf_id, the verifier can also read the address of the
      ksym's corresponding kernel var from kallsyms and use that to fill
      dst_reg.
      
      Therefore, the proper functionality of pseudo_btf_id depends on (1)
      kallsyms and (2) the encoding of kernel global VARs in pahole, which
      should be available since pahole v1.18.
      Signed-off-by: NHao Luo <haoluo@google.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Link: https://lore.kernel.org/bpf/20200929235049.2533242-2-haoluo@google.com
      4976b718
  10. 29 9月, 2020 2 次提交
  11. 18 9月, 2020 3 次提交
    • A
      bpf: Add abnormal return checks. · 09b28d76
      Alexei Starovoitov 提交于
      LD_[ABS|IND] instructions may return from the function early. bpf_tail_call
      pseudo instruction is either fallthrough or return. Allow them in the
      subprograms only when subprograms are BTF annotated and have scalar return
      types. Allow ld_abs and tail_call in the main program even if it calls into
      subprograms. In the past that was not ok to do for ld_abs, since it was JITed
      with special exit sequence. Since bpf_gen_ld_abs() was introduced the ld_abs
      looks like normal exit insn from JIT point of view, so it's safe to allow them
      in the main program.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      09b28d76
    • M
      bpf, x64: rework pro/epilogue and tailcall handling in JIT · ebf7d1f5
      Maciej Fijalkowski 提交于
      This commit serves two things:
      1) it optimizes BPF prologue/epilogue generation
      2) it makes possible to have tailcalls within BPF subprogram
      
      Both points are related to each other since without 1), 2) could not be
      achieved.
      
      In [1], Alexei says:
      "The prologue will look like:
      nop5
      xor eax,eax  // two new bytes if bpf_tail_call() is used in this
                   // function
      push rbp
      mov rbp, rsp
      sub rsp, rounded_stack_depth
      push rax // zero init tail_call counter
      variable number of push rbx,r13,r14,r15
      
      Then bpf_tail_call will pop variable number rbx,..
      and final 'pop rax'
      Then 'add rsp, size_of_current_stack_frame'
      jmp to next function and skip over 'nop5; xor eax,eax; push rpb; mov
      rbp, rsp'
      
      This way new function will set its own stack size and will init tail
      call
      counter with whatever value the parent had.
      
      If next function doesn't use bpf_tail_call it won't have 'xor eax,eax'.
      Instead it would need to have 'nop2' in there."
      
      Implement that suggestion.
      
      Since the layout of stack is changed, tail call counter handling can not
      rely anymore on popping it to rbx just like it have been handled for
      constant prologue case and later overwrite of rbx with actual value of
      rbx pushed to stack. Therefore, let's use one of the register (%rcx) that
      is considered to be volatile/caller-saved and pop the value of tail call
      counter in there in the epilogue.
      
      Drop the BUILD_BUG_ON in emit_prologue and in
      emit_bpf_tail_call_indirect where instruction layout is not constant
      anymore.
      
      Introduce new poke target, 'tailcall_bypass' to poke descriptor that is
      dedicated for skipping the register pops and stack unwind that are
      generated right before the actual jump to target program.
      For case when the target program is not present, BPF program will skip
      the pop instructions and nop5 dedicated for jmpq $target. An example of
      such state when only R6 of callee saved registers is used by program:
      
      ffffffffc0513aa1:       e9 0e 00 00 00          jmpq   0xffffffffc0513ab4
      ffffffffc0513aa6:       5b                      pop    %rbx
      ffffffffc0513aa7:       58                      pop    %rax
      ffffffffc0513aa8:       48 81 c4 00 00 00 00    add    $0x0,%rsp
      ffffffffc0513aaf:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
      ffffffffc0513ab4:       48 89 df                mov    %rbx,%rdi
      
      When target program is inserted, the jump that was there to skip
      pops/nop5 will become the nop5, so CPU will go over pops and do the
      actual tailcall.
      
      One might ask why there simply can not be pushes after the nop5?
      In the following example snippet:
      
      ffffffffc037030c:       48 89 fb                mov    %rdi,%rbx
      (...)
      ffffffffc0370332:       5b                      pop    %rbx
      ffffffffc0370333:       58                      pop    %rax
      ffffffffc0370334:       48 81 c4 00 00 00 00    add    $0x0,%rsp
      ffffffffc037033b:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
      ffffffffc0370340:       48 81 ec 00 00 00 00    sub    $0x0,%rsp
      ffffffffc0370347:       50                      push   %rax
      ffffffffc0370348:       53                      push   %rbx
      ffffffffc0370349:       48 89 df                mov    %rbx,%rdi
      ffffffffc037034c:       e8 f7 21 00 00          callq  0xffffffffc0372548
      
      There is the bpf2bpf call (at ffffffffc037034c) right after the tailcall
      and jump target is not present. ctx is in %rbx register and BPF
      subprogram that we will call into on ffffffffc037034c is relying on it,
      e.g. it will pick ctx from there. Such code layout is therefore broken
      as we would overwrite the content of %rbx with the value that was pushed
      on the prologue. That is the reason for the 'bypass' approach.
      
      Special care needs to be taken during the install/update/remove of
      tailcall target. In case when target program is not present, the CPU
      must not execute the pop instructions that precede the tailcall.
      
      To address that, the following states can be defined:
      A nop, unwind, nop
      B nop, unwind, tail
      C skip, unwind, nop
      D skip, unwind, tail
      
      A is forbidden (lead to incorrectness). The state transitions between
      tailcall install/update/remove will work as follows:
      
      First install tail call f: C->D->B(f)
       * poke the tailcall, after that get rid of the skip
      Update tail call f to f': B(f)->B(f')
       * poke the tailcall (poke->tailcall_target) and do NOT touch the
         poke->tailcall_bypass
      Remove tail call: B(f')->C(f')
       * poke->tailcall_bypass is poked back to jump, then we wait the RCU
         grace period so that other programs will finish its execution and
         after that we are safe to remove the poke->tailcall_target
      Install new tail call (f''): C(f')->D(f'')->B(f'').
       * same as first step
      
      This way CPU can never be exposed to "unwind, tail" state.
      
      Last but not least, when tailcalls get mixed with bpf2bpf calls, it
      would be possible to encounter the endless loop due to clearing the
      tailcall counter if for example we would use the tailcall3-like from BPF
      selftests program that would be subprogram-based, meaning the tailcall
      would be present within the BPF subprogram.
      
      This test, broken down to particular steps, would do:
      entry -> set tailcall counter to 0, bump it by 1, tailcall to func0
      func0 -> call subprog_tail
      (we are NOT skipping the first 11 bytes of prologue and this subprogram
      has a tailcall, therefore we clear the counter...)
      subprog -> do the same thing as entry
      
      and then loop forever.
      
      To address this, the idea is to go through the call chain of bpf2bpf progs
      and look for a tailcall presence throughout whole chain. If we saw a single
      tail call then each node in this call chain needs to be marked as a subprog
      that can reach the tailcall. We would later feed the JIT with this info
      and:
      - set eax to 0 only when tailcall is reachable and this is the entry prog
      - if tailcall is reachable but there's no tailcall in insns of currently
        JITed prog then push rax anyway, so that it will be possible to
        propagate further down the call chain
      - finally if tailcall is reachable, then we need to precede the 'call'
        insn with mov rax, [rbp - (stack_depth + 8)]
      
      Tail call related cases from test_verifier kselftest are also working
      fine. Sample BPF programs that utilize tail calls (sockex3, tracex5)
      work properly as well.
      
      [1]: https://lore.kernel.org/bpf/20200517043227.2gpq22ifoq37ogst@ast-mbp.dhcp.thefacebook.com/Suggested-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      ebf7d1f5
    • M
      bpf: Limit caller's stack depth 256 for subprogs with tailcalls · 7f6e4312
      Maciej Fijalkowski 提交于
      Protect against potential stack overflow that might happen when bpf2bpf
      calls get combined with tailcalls. Limit the caller's stack depth for
      such case down to 256 so that the worst case scenario would result in 8k
      stack size (32 which is tailcall limit * 256 = 8k).
      Suggested-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      7f6e4312
  12. 23 6月, 2020 1 次提交
    • A
      bpf: Support access to bpf map fields · 41c48f3a
      Andrey Ignatov 提交于
      There are multiple use-cases when it's convenient to have access to bpf
      map fields, both `struct bpf_map` and map type specific struct-s such as
      `struct bpf_array`, `struct bpf_htab`, etc.
      
      For example while working with sock arrays it can be necessary to
      calculate the key based on map->max_entries (some_hash % max_entries).
      Currently this is solved by communicating max_entries via "out-of-band"
      channel, e.g. via additional map with known key to get info about target
      map. That works, but is not very convenient and error-prone while
      working with many maps.
      
      In other cases necessary data is dynamic (i.e. unknown at loading time)
      and it's impossible to get it at all. For example while working with a
      hash table it can be convenient to know how much capacity is already
      used (bpf_htab.count.counter for BPF_F_NO_PREALLOC case).
      
      At the same time kernel knows this info and can provide it to bpf
      program.
      
      Fill this gap by adding support to access bpf map fields from bpf
      program for both `struct bpf_map` and map type specific fields.
      
      Support is implemented via btf_struct_access() so that a user can define
      their own `struct bpf_map` or map type specific struct in their program
      with only necessary fields and preserve_access_index attribute, cast a
      map to this struct and use a field.
      
      For example:
      
      	struct bpf_map {
      		__u32 max_entries;
      	} __attribute__((preserve_access_index));
      
      	struct bpf_array {
      		struct bpf_map map;
      		__u32 elem_size;
      	} __attribute__((preserve_access_index));
      
      	struct {
      		__uint(type, BPF_MAP_TYPE_ARRAY);
      		__uint(max_entries, 4);
      		__type(key, __u32);
      		__type(value, __u32);
      	} m_array SEC(".maps");
      
      	SEC("cgroup_skb/egress")
      	int cg_skb(void *ctx)
      	{
      		struct bpf_array *array = (struct bpf_array *)&m_array;
      		struct bpf_map *map = (struct bpf_map *)&m_array;
      
      		/* .. use map->max_entries or array->map.max_entries .. */
      	}
      
      Similarly to other btf_struct_access() use-cases (e.g. struct tcp_sock
      in net/ipv4/bpf_tcp_ca.c) the patch allows access to any fields of
      corresponding struct. Only reading from map fields is supported.
      
      For btf_struct_access() to work there should be a way to know btf id of
      a struct that corresponds to a map type. To get btf id there should be a
      way to get a stringified name of map-specific struct, such as
      "bpf_array", "bpf_htab", etc for a map type. Two new fields are added to
      `struct bpf_map_ops` to handle it:
      * .map_btf_name keeps a btf name of a struct returned by map_alloc();
      * .map_btf_id is used to cache btf id of that struct.
      
      To make btf ids calculation cheaper they're calculated once while
      preparing btf_vmlinux and cached same way as it's done for btf_id field
      of `struct bpf_func_proto`
      
      While calculating btf ids, struct names are NOT checked for collision.
      Collisions will be checked as a part of the work to prepare btf ids used
      in verifier in compile time that should land soon. The only known
      collision for `struct bpf_htab` (kernel/bpf/hashtab.c vs
      net/core/sock_map.c) was fixed earlier.
      
      Both new fields .map_btf_name and .map_btf_id must be set for a map type
      for the feature to work. If neither is set for a map type, verifier will
      return ENOTSUPP on a try to access map_ptr of corresponding type. If
      just one of them set, it's verifier misconfiguration.
      
      Only `struct bpf_array` for BPF_MAP_TYPE_ARRAY and `struct bpf_htab` for
      BPF_MAP_TYPE_HASH are supported by this patch. Other map types will be
      supported separately.
      
      The feature is available only for CONFIG_DEBUG_INFO_BTF=y and gated by
      perfmon_capable() so that unpriv programs won't have access to bpf map
      fields.
      Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/6479686a0cd1e9067993df57b4c3eef0e276fec9.1592600985.git.rdna@fb.com
      41c48f3a
  13. 02 6月, 2020 1 次提交
    • A
      bpf: Implement BPF ring buffer and verifier support for it · 457f4436
      Andrii Nakryiko 提交于
      This commit adds a new MPSC ring buffer implementation into BPF ecosystem,
      which allows multiple CPUs to submit data to a single shared ring buffer. On
      the consumption side, only single consumer is assumed.
      
      Motivation
      ----------
      There are two distinctive motivators for this work, which are not satisfied by
      existing perf buffer, which prompted creation of a new ring buffer
      implementation.
        - more efficient memory utilization by sharing ring buffer across CPUs;
        - preserving ordering of events that happen sequentially in time, even
        across multiple CPUs (e.g., fork/exec/exit events for a task).
      
      These two problems are independent, but perf buffer fails to satisfy both.
      Both are a result of a choice to have per-CPU perf ring buffer.  Both can be
      also solved by having an MPSC implementation of ring buffer. The ordering
      problem could technically be solved for perf buffer with some in-kernel
      counting, but given the first one requires an MPSC buffer, the same solution
      would solve the second problem automatically.
      
      Semantics and APIs
      ------------------
      Single ring buffer is presented to BPF programs as an instance of BPF map of
      type BPF_MAP_TYPE_RINGBUF. Two other alternatives considered, but ultimately
      rejected.
      
      One way would be to, similar to BPF_MAP_TYPE_PERF_EVENT_ARRAY, make
      BPF_MAP_TYPE_RINGBUF could represent an array of ring buffers, but not enforce
      "same CPU only" rule. This would be more familiar interface compatible with
      existing perf buffer use in BPF, but would fail if application needed more
      advanced logic to lookup ring buffer by arbitrary key. HASH_OF_MAPS addresses
      this with current approach. Additionally, given the performance of BPF
      ringbuf, many use cases would just opt into a simple single ring buffer shared
      among all CPUs, for which current approach would be an overkill.
      
      Another approach could introduce a new concept, alongside BPF map, to
      represent generic "container" object, which doesn't necessarily have key/value
      interface with lookup/update/delete operations. This approach would add a lot
      of extra infrastructure that has to be built for observability and verifier
      support. It would also add another concept that BPF developers would have to
      familiarize themselves with, new syntax in libbpf, etc. But then would really
      provide no additional benefits over the approach of using a map.
      BPF_MAP_TYPE_RINGBUF doesn't support lookup/update/delete operations, but so
      doesn't few other map types (e.g., queue and stack; array doesn't support
      delete, etc).
      
      The approach chosen has an advantage of re-using existing BPF map
      infrastructure (introspection APIs in kernel, libbpf support, etc), being
      familiar concept (no need to teach users a new type of object in BPF program),
      and utilizing existing tooling (bpftool). For common scenario of using
      a single ring buffer for all CPUs, it's as simple and straightforward, as
      would be with a dedicated "container" object. On the other hand, by being
      a map, it can be combined with ARRAY_OF_MAPS and HASH_OF_MAPS map-in-maps to
      implement a wide variety of topologies, from one ring buffer for each CPU
      (e.g., as a replacement for perf buffer use cases), to a complicated
      application hashing/sharding of ring buffers (e.g., having a small pool of
      ring buffers with hashed task's tgid being a look up key to preserve order,
      but reduce contention).
      
      Key and value sizes are enforced to be zero. max_entries is used to specify
      the size of ring buffer and has to be a power of 2 value.
      
      There are a bunch of similarities between perf buffer
      (BPF_MAP_TYPE_PERF_EVENT_ARRAY) and new BPF ring buffer semantics:
        - variable-length records;
        - if there is no more space left in ring buffer, reservation fails, no
          blocking;
        - memory-mappable data area for user-space applications for ease of
          consumption and high performance;
        - epoll notifications for new incoming data;
        - but still the ability to do busy polling for new data to achieve the
          lowest latency, if necessary.
      
      BPF ringbuf provides two sets of APIs to BPF programs:
        - bpf_ringbuf_output() allows to *copy* data from one place to a ring
          buffer, similarly to bpf_perf_event_output();
        - bpf_ringbuf_reserve()/bpf_ringbuf_commit()/bpf_ringbuf_discard() APIs
          split the whole process into two steps. First, a fixed amount of space is
          reserved. If successful, a pointer to a data inside ring buffer data area
          is returned, which BPF programs can use similarly to a data inside
          array/hash maps. Once ready, this piece of memory is either committed or
          discarded. Discard is similar to commit, but makes consumer ignore the
          record.
      
      bpf_ringbuf_output() has disadvantage of incurring extra memory copy, because
      record has to be prepared in some other place first. But it allows to submit
      records of the length that's not known to verifier beforehand. It also closely
      matches bpf_perf_event_output(), so will simplify migration significantly.
      
      bpf_ringbuf_reserve() avoids the extra copy of memory by providing a memory
      pointer directly to ring buffer memory. In a lot of cases records are larger
      than BPF stack space allows, so many programs have use extra per-CPU array as
      a temporary heap for preparing sample. bpf_ringbuf_reserve() avoid this needs
      completely. But in exchange, it only allows a known constant size of memory to
      be reserved, such that verifier can verify that BPF program can't access
      memory outside its reserved record space. bpf_ringbuf_output(), while slightly
      slower due to extra memory copy, covers some use cases that are not suitable
      for bpf_ringbuf_reserve().
      
      The difference between commit and discard is very small. Discard just marks
      a record as discarded, and such records are supposed to be ignored by consumer
      code. Discard is useful for some advanced use-cases, such as ensuring
      all-or-nothing multi-record submission, or emulating temporary malloc()/free()
      within single BPF program invocation.
      
      Each reserved record is tracked by verifier through existing
      reference-tracking logic, similar to socket ref-tracking. It is thus
      impossible to reserve a record, but forget to submit (or discard) it.
      
      bpf_ringbuf_query() helper allows to query various properties of ring buffer.
      Currently 4 are supported:
        - BPF_RB_AVAIL_DATA returns amount of unconsumed data in ring buffer;
        - BPF_RB_RING_SIZE returns the size of ring buffer;
        - BPF_RB_CONS_POS/BPF_RB_PROD_POS returns current logical possition of
          consumer/producer, respectively.
      Returned values are momentarily snapshots of ring buffer state and could be
      off by the time helper returns, so this should be used only for
      debugging/reporting reasons or for implementing various heuristics, that take
      into account highly-changeable nature of some of those characteristics.
      
      One such heuristic might involve more fine-grained control over poll/epoll
      notifications about new data availability in ring buffer. Together with
      BPF_RB_NO_WAKEUP/BPF_RB_FORCE_WAKEUP flags for output/commit/discard helpers,
      it allows BPF program a high degree of control and, e.g., more efficient
      batched notifications. Default self-balancing strategy, though, should be
      adequate for most applications and will work reliable and efficiently already.
      
      Design and implementation
      -------------------------
      This reserve/commit schema allows a natural way for multiple producers, either
      on different CPUs or even on the same CPU/in the same BPF program, to reserve
      independent records and work with them without blocking other producers. This
      means that if BPF program was interruped by another BPF program sharing the
      same ring buffer, they will both get a record reserved (provided there is
      enough space left) and can work with it and submit it independently. This
      applies to NMI context as well, except that due to using a spinlock during
      reservation, in NMI context, bpf_ringbuf_reserve() might fail to get a lock,
      in which case reservation will fail even if ring buffer is not full.
      
      The ring buffer itself internally is implemented as a power-of-2 sized
      circular buffer, with two logical and ever-increasing counters (which might
      wrap around on 32-bit architectures, that's not a problem):
        - consumer counter shows up to which logical position consumer consumed the
          data;
        - producer counter denotes amount of data reserved by all producers.
      
      Each time a record is reserved, producer that "owns" the record will
      successfully advance producer counter. At that point, data is still not yet
      ready to be consumed, though. Each record has 8 byte header, which contains
      the length of reserved record, as well as two extra bits: busy bit to denote
      that record is still being worked on, and discard bit, which might be set at
      commit time if record is discarded. In the latter case, consumer is supposed
      to skip the record and move on to the next one. Record header also encodes
      record's relative offset from the beginning of ring buffer data area (in
      pages). This allows bpf_ringbuf_commit()/bpf_ringbuf_discard() to accept only
      the pointer to the record itself, without requiring also the pointer to ring
      buffer itself. Ring buffer memory location will be restored from record
      metadata header. This significantly simplifies verifier, as well as improving
      API usability.
      
      Producer counter increments are serialized under spinlock, so there is
      a strict ordering between reservations. Commits, on the other hand, are
      completely lockless and independent. All records become available to consumer
      in the order of reservations, but only after all previous records where
      already committed. It is thus possible for slow producers to temporarily hold
      off submitted records, that were reserved later.
      
      Reservation/commit/consumer protocol is verified by litmus tests in
      Documentation/litmus-test/bpf-rb.
      
      One interesting implementation bit, that significantly simplifies (and thus
      speeds up as well) implementation of both producers and consumers is how data
      area is mapped twice contiguously back-to-back in the virtual memory. This
      allows to not take any special measures for samples that have to wrap around
      at the end of the circular buffer data area, because the next page after the
      last data page would be first data page again, and thus the sample will still
      appear completely contiguous in virtual memory. See comment and a simple ASCII
      diagram showing this visually in bpf_ringbuf_area_alloc().
      
      Another feature that distinguishes BPF ringbuf from perf ring buffer is
      a self-pacing notifications of new data being availability.
      bpf_ringbuf_commit() implementation will send a notification of new record
      being available after commit only if consumer has already caught up right up
      to the record being committed. If not, consumer still has to catch up and thus
      will see new data anyways without needing an extra poll notification.
      Benchmarks (see tools/testing/selftests/bpf/benchs/bench_ringbuf.c) show that
      this allows to achieve a very high throughput without having to resort to
      tricks like "notify only every Nth sample", which are necessary with perf
      buffer. For extreme cases, when BPF program wants more manual control of
      notifications, commit/discard/output helpers accept BPF_RB_NO_WAKEUP and
      BPF_RB_FORCE_WAKEUP flags, which give full control over notifications of data
      availability, but require extra caution and diligence in using this API.
      
      Comparison to alternatives
      --------------------------
      Before considering implementing BPF ring buffer from scratch existing
      alternatives in kernel were evaluated, but didn't seem to meet the needs. They
      largely fell into few categores:
        - per-CPU buffers (perf, ftrace, etc), which don't satisfy two motivations
          outlined above (ordering and memory consumption);
        - linked list-based implementations; while some were multi-producer designs,
          consuming these from user-space would be very complicated and most
          probably not performant; memory-mapping contiguous piece of memory is
          simpler and more performant for user-space consumers;
        - io_uring is SPSC, but also requires fixed-sized elements. Naively turning
          SPSC queue into MPSC w/ lock would have subpar performance compared to
          locked reserve + lockless commit, as with BPF ring buffer. Fixed sized
          elements would be too limiting for BPF programs, given existing BPF
          programs heavily rely on variable-sized perf buffer already;
        - specialized implementations (like a new printk ring buffer, [0]) with lots
          of printk-specific limitations and implications, that didn't seem to fit
          well for intended use with BPF programs.
      
        [0] https://lwn.net/Articles/779550/Signed-off-by: NAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20200529075424.3139988-2-andriin@fb.comSigned-off-by: NAlexei Starovoitov <ast@kernel.org>
      457f4436
  14. 15 5月, 2020 1 次提交
    • A
      bpf: Implement CAP_BPF · 2c78ee89
      Alexei Starovoitov 提交于
      Implement permissions as stated in uapi/linux/capability.h
      In order to do that the verifier allow_ptr_leaks flag is split
      into four flags and they are set as:
        env->allow_ptr_leaks = bpf_allow_ptr_leaks();
        env->bypass_spec_v1 = bpf_bypass_spec_v1();
        env->bypass_spec_v4 = bpf_bypass_spec_v4();
        env->bpf_capable = bpf_capable();
      
      The first three currently equivalent to perfmon_capable(), since leaking kernel
      pointers and reading kernel memory via side channel attacks is roughly
      equivalent to reading kernel memory with cap_perfmon.
      
      'bpf_capable' enables bounded loops, precision tracking, bpf to bpf calls and
      other verifier features. 'allow_ptr_leaks' enable ptr leaks, ptr conversions,
      subtraction of pointers. 'bypass_spec_v1' disables speculative analysis in the
      verifier, run time mitigations in bpf array, and enables indirect variable
      access in bpf programs. 'bypass_spec_v4' disables emission of sanitation code
      by the verifier.
      
      That means that the networking BPF program loaded with CAP_BPF + CAP_NET_ADMIN
      will have speculative checks done by the verifier and other spectre mitigation
      applied. Such networking BPF program will not be able to leak kernel pointers
      and will not be able to access arbitrary kernel memory.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20200513230355.7858-3-alexei.starovoitov@gmail.com
      2c78ee89
  15. 31 3月, 2020 1 次提交
    • J
      bpf: Verifier, do explicit ALU32 bounds tracking · 3f50f132
      John Fastabend 提交于
      It is not possible for the current verifier to track ALU32 and JMP ops
      correctly. This can result in the verifier aborting with errors even though
      the program should be verifiable. BPF codes that hit this can work around
      it by changin int variables to 64-bit types, marking variables volatile,
      etc. But this is all very ugly so it would be better to avoid these tricks.
      
      But, the main reason to address this now is do_refine_retval_range() was
      assuming return values could not be negative. Once we fixed this code that
      was previously working will no longer work. See do_refine_retval_range()
      patch for details. And we don't want to suddenly cause programs that used
      to work to fail.
      
      The simplest example code snippet that illustrates the problem is likely
      this,
      
       53: w8 = w0                    // r8 <- [0, S32_MAX],
                                      // w8 <- [-S32_MIN, X]
       54: w8 <s 0                    // r8 <- [0, U32_MAX]
                                      // w8 <- [0, X]
      
      The expected 64-bit and 32-bit bounds after each line are shown on the
      right. The current issue is without the w* bounds we are forced to use
      the worst case bound of [0, U32_MAX]. To resolve this type of case,
      jmp32 creating divergent 32-bit bounds from 64-bit bounds, we add explicit
      32-bit register bounds s32_{min|max}_value and u32_{min|max}_value. Then
      from branch_taken logic creating new bounds we can track 32-bit bounds
      explicitly.
      
      The next case we observed is ALU ops after the jmp32,
      
       53: w8 = w0                    // r8 <- [0, S32_MAX],
                                      // w8 <- [-S32_MIN, X]
       54: w8 <s 0                    // r8 <- [0, U32_MAX]
                                      // w8 <- [0, X]
       55: w8 += 1                    // r8 <- [0, U32_MAX+1]
                                      // w8 <- [0, X+1]
      
      In order to keep the bounds accurate at this point we also need to track
      ALU32 ops. To do this we add explicit ALU32 logic for each of the ALU
      ops, mov, add, sub, etc.
      
      Finally there is a question of how and when to merge bounds. The cases
      enumerate here,
      
      1. MOV ALU32   - zext 32-bit -> 64-bit
      2. MOV ALU64   - copy 64-bit -> 32-bit
      3. op  ALU32   - zext 32-bit -> 64-bit
      4. op  ALU64   - n/a
      5. jmp ALU32   - 64-bit: var32_off | upper_32_bits(var64_off)
      6. jmp ALU64   - 32-bit: (>> (<< var64_off))
      
      Details for each case,
      
      For "MOV ALU32" BPF arch zero extends so we simply copy the bounds
      from 32-bit into 64-bit ensuring we truncate var_off and 64-bit
      bounds correctly. See zext_32_to_64.
      
      For "MOV ALU64" copy all bounds including 32-bit into new register. If
      the src register had 32-bit bounds the dst register will as well.
      
      For "op ALU32" zero extend 32-bit into 64-bit the same as move,
      see zext_32_to_64.
      
      For "op ALU64" calculate both 32-bit and 64-bit bounds no merging
      is done here. Except we have a special case. When RSH or ARSH is
      done we can't simply ignore shifting bits from 64-bit reg into the
      32-bit subreg. So currently just push bounds from 64-bit into 32-bit.
      This will be correct in the sense that they will represent a valid
      state of the register. However we could lose some accuracy if an
      ARSH is following a jmp32 operation. We can handle this special
      case in a follow up series.
      
      For "jmp ALU32" mark 64-bit reg unknown and recalculate 64-bit bounds
      from tnum by setting var_off to ((<<(>>var_off)) | var32_off). We
      special case if 64-bit bounds has zero'd upper 32bits at which point
      we can simply copy 32-bit bounds into 64-bit register. This catches
      a common compiler trick where upper 32-bits are zeroed and then
      32-bit ops are used followed by a 64-bit compare or 64-bit op on
      a pointer. See __reg_combine_64_into_32().
      
      For "jmp ALU64" cast the bounds of the 64bit to their 32-bit
      counterpart. For example s32_min_value = (s32)reg->smin_value. For
      tnum use only the lower 32bits via, (>>(<<var_off)). See
      __reg_combine_64_into_32().
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/158560419880.10843.11448220440809118343.stgit@john-Precision-5820-Tower
      3f50f132
  16. 11 1月, 2020 1 次提交
    • A
      bpf: Introduce function-by-function verification · 51c39bb1
      Alexei Starovoitov 提交于
      New llvm and old llvm with libbpf help produce BTF that distinguish global and
      static functions. Unlike arguments of static function the arguments of global
      functions cannot be removed or optimized away by llvm. The compiler has to use
      exactly the arguments specified in a function prototype. The argument type
      information allows the verifier validate each global function independently.
      For now only supported argument types are pointer to context and scalars. In
      the future pointers to structures, sizes, pointer to packet data can be
      supported as well. Consider the following example:
      
      static int f1(int ...)
      {
        ...
      }
      
      int f3(int b);
      
      int f2(int a)
      {
        f1(a) + f3(a);
      }
      
      int f3(int b)
      {
        ...
      }
      
      int main(...)
      {
        f1(...) + f2(...) + f3(...);
      }
      
      The verifier will start its safety checks from the first global function f2().
      It will recursively descend into f1() because it's static. Then it will check
      that arguments match for the f3() invocation inside f2(). It will not descend
      into f3(). It will finish f2() that has to be successfully verified for all
      possible values of 'a'. Then it will proceed with f3(). That function also has
      to be safe for all possible values of 'b'. Then it will start subprog 0 (which
      is main() function). It will recursively descend into f1() and will skip full
      check of f2() and f3(), since they are global. The order of processing global
      functions doesn't affect safety, since all global functions must be proven safe
      based on their arguments only.
      
      Such function by function verification can drastically improve speed of the
      verification and reduce complexity.
      
      Note that the stack limit of 512 still applies to the call chain regardless whether
      functions were static or global. The nested level of 8 also still applies. The
      same recursion prevention checks are in place as well.
      
      The type information and static/global kind is preserved after the verification
      hence in the above example global function f2() and f3() can be replaced later
      by equivalent functions with the same types that are loaded and verified later
      without affecting safety of this main() program. Such replacement (re-linking)
      of global functions is a subject of future patches.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Link: https://lore.kernel.org/bpf/20200110064124.1760511-3-ast@kernel.org
      51c39bb1
  17. 25 11月, 2019 1 次提交
    • D
      bpf: Constant map key tracking for prog array pokes · d2e4c1e6
      Daniel Borkmann 提交于
      Add tracking of constant keys into tail call maps. The signature of
      bpf_tail_call_proto is that arg1 is ctx, arg2 map pointer and arg3
      is a index key. The direct call approach for tail calls can be enabled
      if the verifier asserted that for all branches leading to the tail call
      helper invocation, the map pointer and index key were both constant
      and the same.
      
      Tracking of map pointers we already do from prior work via c93552c4
      ("bpf: properly enforce index mask to prevent out-of-bounds speculation")
      and 09772d92 ("bpf: avoid retpoline for lookup/update/ delete calls
      on maps").
      
      Given the tail call map index key is not on stack but directly in the
      register, we can add similar tracking approach and later in fixup_bpf_calls()
      add a poke descriptor to the progs poke_tab with the relevant information
      for the JITing phase.
      
      We internally reuse insn->imm for the rewritten BPF_JMP | BPF_TAIL_CALL
      instruction in order to point into the prog's poke_tab, and keep insn->imm
      as 0 as indicator that current indirect tail call emission must be used.
      Note that publishing to the tracker must happen at the end of fixup_bpf_calls()
      since adding elements to the poke_tab reallocates its memory, so we need
      to wait until its in final state.
      
      Future work can generalize and add similar approach to optimize plain
      array map lookups. Difference there is that we need to look into the key
      value that sits on stack. For clarity in bpf_insn_aux_data, map_state
      has been renamed into map_ptr_state, so we get map_{ptr,key}_state as
      trackers.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Link: https://lore.kernel.org/bpf/e8db37f6b2ae60402fa40216c96738ee9b316c32.1574452833.git.daniel@iogearbox.net
      d2e4c1e6
  18. 16 11月, 2019 1 次提交
  19. 17 10月, 2019 2 次提交
    • A
      bpf: Implement accurate raw_tp context access via BTF · 9e15db66
      Alexei Starovoitov 提交于
      libbpf analyzes bpf C program, searches in-kernel BTF for given type name
      and stores it into expected_attach_type.
      The kernel verifier expects this btf_id to point to something like:
      typedef void (*btf_trace_kfree_skb)(void *, struct sk_buff *skb, void *loc);
      which represents signature of raw_tracepoint "kfree_skb".
      
      Then btf_ctx_access() matches ctx+0 access in bpf program with 'skb'
      and 'ctx+8' access with 'loc' arguments of "kfree_skb" tracepoint.
      In first case it passes btf_id of 'struct sk_buff *' back to the verifier core
      and 'void *' in second case.
      
      Then the verifier tracks PTR_TO_BTF_ID as any other pointer type.
      Like PTR_TO_SOCKET points to 'struct bpf_sock',
      PTR_TO_TCP_SOCK points to 'struct bpf_tcp_sock', and so on.
      PTR_TO_BTF_ID points to in-kernel structs.
      If 1234 is btf_id of 'struct sk_buff' in vmlinux's BTF
      then PTR_TO_BTF_ID#1234 points to one of in kernel skbs.
      
      When PTR_TO_BTF_ID#1234 is dereferenced (like r2 = *(u64 *)r1 + 32)
      the btf_struct_access() checks which field of 'struct sk_buff' is
      at offset 32. Checks that size of access matches type definition
      of the field and continues to track the dereferenced type.
      If that field was a pointer to 'struct net_device' the r2's type
      will be PTR_TO_BTF_ID#456. Where 456 is btf_id of 'struct net_device'
      in vmlinux's BTF.
      
      Such verifier analysis prevents "cheating" in BPF C program.
      The program cannot cast arbitrary pointer to 'struct sk_buff *'
      and access it. C compiler would allow type cast, of course,
      but the verifier will notice type mismatch based on BPF assembly
      and in-kernel BTF.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/20191016032505.2089704-7-ast@kernel.org
      9e15db66
    • A
      bpf: Process in-kernel BTF · 8580ac94
      Alexei Starovoitov 提交于
      If in-kernel BTF exists parse it and prepare 'struct btf *btf_vmlinux'
      for further use by the verifier.
      In-kernel BTF is trusted just like kallsyms and other build artifacts
      embedded into vmlinux.
      Yet run this BTF image through BTF verifier to make sure
      that it is valid and it wasn't mangled during the build.
      If in-kernel BTF is incorrect it means either gcc or pahole or kernel
      are buggy. In such case disallow loading BPF programs.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/20191016032505.2089704-4-ast@kernel.org
      8580ac94
  20. 28 8月, 2019 1 次提交
  21. 19 6月, 2019 2 次提交
    • A
      bpf: precise scalar_value tracking · b5dc0163
      Alexei Starovoitov 提交于
      Introduce precision tracking logic that
      helps cilium programs the most:
                        old clang  old clang    new clang  new clang
                                with all patches         with all patches
      bpf_lb-DLB_L3.o      1838     2283         1923       1863
      bpf_lb-DLB_L4.o      3218     2657         3077       2468
      bpf_lb-DUNKNOWN.o    1064     545          1062       544
      bpf_lxc-DDROP_ALL.o  26935    23045        166729     22629
      bpf_lxc-DUNKNOWN.o   34439    35240        174607     28805
      bpf_netdev.o         9721     8753         8407       6801
      bpf_overlay.o        6184     7901         5420       4754
      bpf_lxc_jit.o        39389    50925        39389      50925
      
      Consider code:
      654: (85) call bpf_get_hash_recalc#34
      655: (bf) r7 = r0
      656: (15) if r8 == 0x0 goto pc+29
      657: (bf) r2 = r10
      658: (07) r2 += -48
      659: (18) r1 = 0xffff8881e41e1b00
      661: (85) call bpf_map_lookup_elem#1
      662: (15) if r0 == 0x0 goto pc+23
      663: (69) r1 = *(u16 *)(r0 +0)
      664: (15) if r1 == 0x0 goto pc+21
      665: (bf) r8 = r7
      666: (57) r8 &= 65535
      667: (bf) r2 = r8
      668: (3f) r2 /= r1
      669: (2f) r2 *= r1
      670: (bf) r1 = r8
      671: (1f) r1 -= r2
      672: (57) r1 &= 255
      673: (25) if r1 > 0x1e goto pc+12
       R0=map_value(id=0,off=0,ks=20,vs=64,imm=0) R1_w=inv(id=0,umax_value=30,var_off=(0x0; 0x1f))
      674: (67) r1 <<= 1
      675: (0f) r0 += r1
      
      At this point the verifier will notice that scalar R1 is used in map pointer adjustment.
      R1 has to be precise for later operations on R0 to be validated properly.
      
      The verifier will backtrack the above code in the following way:
      last_idx 675 first_idx 664
      regs=2 stack=0 before 675: (0f) r0 += r1         // started backtracking R1 regs=2 is a bitmask
      regs=2 stack=0 before 674: (67) r1 <<= 1
      regs=2 stack=0 before 673: (25) if r1 > 0x1e goto pc+12
      regs=2 stack=0 before 672: (57) r1 &= 255
      regs=2 stack=0 before 671: (1f) r1 -= r2         // now both R1 and R2 has to be precise -> regs=6 mask
      regs=6 stack=0 before 670: (bf) r1 = r8          // after this insn R8 and R2 has to be precise
      regs=104 stack=0 before 669: (2f) r2 *= r1       // after this one R8, R2, and R1
      regs=106 stack=0 before 668: (3f) r2 /= r1
      regs=106 stack=0 before 667: (bf) r2 = r8
      regs=102 stack=0 before 666: (57) r8 &= 65535
      regs=102 stack=0 before 665: (bf) r8 = r7
      regs=82 stack=0 before 664: (15) if r1 == 0x0 goto pc+21
       // this is the end of verifier state. The following regs will be marked precised:
       R1_rw=invP(id=0,umax_value=65535,var_off=(0x0; 0xffff)) R7_rw=invP(id=0)
      parent didn't have regs=82 stack=0 marks         // so backtracking continues into parent state
      last_idx 663 first_idx 655
      regs=82 stack=0 before 663: (69) r1 = *(u16 *)(r0 +0)   // R1 was assigned no need to track it further
      regs=80 stack=0 before 662: (15) if r0 == 0x0 goto pc+23    // keep tracking R7
      regs=80 stack=0 before 661: (85) call bpf_map_lookup_elem#1  // keep tracking R7
      regs=80 stack=0 before 659: (18) r1 = 0xffff8881e41e1b00
      regs=80 stack=0 before 658: (07) r2 += -48
      regs=80 stack=0 before 657: (bf) r2 = r10
      regs=80 stack=0 before 656: (15) if r8 == 0x0 goto pc+29
      regs=80 stack=0 before 655: (bf) r7 = r0                // here the assignment into R7
       // mark R0 to be precise:
       R0_rw=invP(id=0)
      parent didn't have regs=1 stack=0 marks                 // regs=1 -> tracking R0
      last_idx 654 first_idx 644
      regs=1 stack=0 before 654: (85) call bpf_get_hash_recalc#34 // and in the parent frame it was a return value
        // nothing further to backtrack
      
      Two scalar registers not marked precise are equivalent from state pruning point of view.
      More details in the patch comments.
      
      It doesn't support bpf2bpf calls yet and enabled for root only.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      b5dc0163
    • A
      bpf: introduce bounded loops · 2589726d
      Alexei Starovoitov 提交于
      Allow the verifier to validate the loops by simulating their execution.
      Exisiting programs have used '#pragma unroll' to unroll the loops
      by the compiler. Instead let the verifier simulate all iterations
      of the loop.
      In order to do that introduce parentage chain of bpf_verifier_state and
      'branches' counter for the number of branches left to explore.
      See more detailed algorithm description in bpf_verifier.h
      
      This algorithm borrows the key idea from Edward Cree approach:
      https://patchwork.ozlabs.org/patch/877222/
      Additional state pruning heuristics make such brute force loop walk
      practical even for large loops.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      2589726d
  22. 31 5月, 2019 1 次提交
  23. 25 5月, 2019 1 次提交
    • J
      bpf: verifier: mark verified-insn with sub-register zext flag · 5327ed3d
      Jiong Wang 提交于
      eBPF ISA specification requires high 32-bit cleared when low 32-bit
      sub-register is written. This applies to destination register of ALU32 etc.
      JIT back-ends must guarantee this semantic when doing code-gen. x86_64 and
      AArch64 ISA has the same semantics, so the corresponding JIT back-end
      doesn't need to do extra work.
      
      However, 32-bit arches (arm, x86, nfp etc.) and some other 64-bit arches
      (PowerPC, SPARC etc) need to do explicit zero extension to meet this
      requirement, otherwise code like the following will fail.
      
        u64_value = (u64) u32_value
        ... other uses of u64_value
      
      This is because compiler could exploit the semantic described above and
      save those zero extensions for extending u32_value to u64_value, these JIT
      back-ends are expected to guarantee this through inserting extra zero
      extensions which however could be a significant increase on the code size.
      Some benchmarks show there could be ~40% sub-register writes out of total
      insns, meaning at least ~40% extra code-gen.
      
      One observation is these extra zero extensions are not always necessary.
      Take above code snippet for example, it is possible u32_value will never be
      casted into a u64, the value of high 32-bit of u32_value then could be
      ignored and extra zero extension could be eliminated.
      
      This patch implements this idea, insns defining sub-registers will be
      marked when the high 32-bit of the defined sub-register matters. For
      those unmarked insns, it is safe to eliminate high 32-bit clearnace for
      them.
      
      Algo:
       - Split read flags into READ32 and READ64.
      
       - Record index of insn that does sub-register write. Keep the index inside
         reg state and update it during verifier insn walking.
      
       - A full register read on a sub-register marks its definition insn as
         needing zero extension on dst register.
      
         A new sub-register write overrides the old one.
      
       - When propagating read64 during path pruning, also mark any insn defining
         a sub-register that is read in the pruned path as full-register.
      Reviewed-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: NJiong Wang <jiong.wang@netronome.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      5327ed3d
  24. 24 5月, 2019 2 次提交
    • A
      bpf: convert explored_states to hash table · dc2a4ebc
      Alexei Starovoitov 提交于
      All prune points inside a callee bpf function most likely will have
      different callsites. For example, if function foo() is called from
      two callsites the half of explored states in all prune points in foo()
      will be useless for subsequent walking of one of those callsites.
      Fortunately explored_states pruning heuristics keeps the number of states
      per prune point small, but walking these states is still a waste of cpu
      time when the callsite of the current state is different from the callsite
      of the explored state.
      
      To improve pruning logic convert explored_states into hash table and
      use simple insn_idx ^ callsite hash to select hash bucket.
      This optimization has no effect on programs without bpf2bpf calls
      and drastically improves programs with calls.
      In the later case it reduces total memory consumption in 1M scale tests
      by almost 3 times (peak_states drops from 5752 to 2016).
      
      Care should be taken when comparing the states for equivalency.
      Since the same hash bucket can now contain states with different indices
      the insn_idx has to be part of verifier_state and compared.
      
      Different hash table sizes and different hash functions were explored,
      but the results were not significantly better vs this patch.
      They can be improved in the future.
      
      Hit/miss heuristic is not counting index miscompare as a miss.
      Otherwise verifier stats become unstable when experimenting
      with different hash functions.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      dc2a4ebc
    • A
      bpf: split explored_states · a8f500af
      Alexei Starovoitov 提交于
      split explored_states into prune_point boolean mark
      and link list of explored states.
      This removes STATE_LIST_MARK hack and allows marks to be separate from states.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      a8f500af
  25. 23 4月, 2019 1 次提交
  26. 10 4月, 2019 1 次提交
    • D
      bpf: implement lookup-free direct value access for maps · d8eca5bb
      Daniel Borkmann 提交于
      This generic extension to BPF maps allows for directly loading
      an address residing inside a BPF map value as a single BPF
      ldimm64 instruction!
      
      The idea is similar to what BPF_PSEUDO_MAP_FD does today, which
      is a special src_reg flag for ldimm64 instruction that indicates
      that inside the first part of the double insns's imm field is a
      file descriptor which the verifier then replaces as a full 64bit
      address of the map into both imm parts. For the newly added
      BPF_PSEUDO_MAP_VALUE src_reg flag, the idea is the following:
      the first part of the double insns's imm field is again a file
      descriptor corresponding to the map, and the second part of the
      imm field is an offset into the value. The verifier will then
      replace both imm parts with an address that points into the BPF
      map value at the given value offset for maps that support this
      operation. Currently supported is array map with single entry.
      It is possible to support more than just single map element by
      reusing both 16bit off fields of the insns as a map index, so
      full array map lookup could be expressed that way. It hasn't
      been implemented here due to lack of concrete use case, but
      could easily be done so in future in a compatible way, since
      both off fields right now have to be 0 and would correctly
      denote a map index 0.
      
      The BPF_PSEUDO_MAP_VALUE is a distinct flag as otherwise with
      BPF_PSEUDO_MAP_FD we could not differ offset 0 between load of
      map pointer versus load of map's value at offset 0, and changing
      BPF_PSEUDO_MAP_FD's encoding into off by one to differ between
      regular map pointer and map value pointer would add unnecessary
      complexity and increases barrier for debugability thus less
      suitable. Using the second part of the imm field as an offset
      into the value does /not/ come with limitations since maximum
      possible value size is in u32 universe anyway.
      
      This optimization allows for efficiently retrieving an address
      to a map value memory area without having to issue a helper call
      which needs to prepare registers according to calling convention,
      etc, without needing the extra NULL test, and without having to
      add the offset in an additional instruction to the value base
      pointer. The verifier then treats the destination register as
      PTR_TO_MAP_VALUE with constant reg->off from the user passed
      offset from the second imm field, and guarantees that this is
      within bounds of the map value. Any subsequent operations are
      normally treated as typical map value handling without anything
      extra needed from verification side.
      
      The two map operations for direct value access have been added to
      array map for now. In future other types could be supported as
      well depending on the use case. The main use case for this commit
      is to allow for BPF loader support for global variables that
      reside in .data/.rodata/.bss sections such that we can directly
      load the address of them with minimal additional infrastructure
      required. Loader support has been added in subsequent commits for
      libbpf library.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      d8eca5bb
  27. 04 4月, 2019 2 次提交
    • A
      bpf: improve verification speed by droping states · 9f4686c4
      Alexei Starovoitov 提交于
      Branch instructions, branch targets and calls in a bpf program are
      the places where the verifier remembers states that led to successful
      verification of the program.
      These states are used to prune brute force program analysis.
      For unprivileged programs there is a limit of 64 states per such
      'branching' instructions (maximum length is tracked by max_states_per_insn
      counter introduced in the previous patch).
      Simply reducing this threshold to 32 or lower increases insn_processed
      metric to the point that small valid programs get rejected.
      For root programs there is no limit and cilium programs can have
      max_states_per_insn to be 100 or higher.
      Walking 100+ states multiplied by number of 'branching' insns during
      verification consumes significant amount of cpu time.
      Turned out simple LRU-like mechanism can be used to remove states
      that unlikely will be helpful in future search pruning.
      This patch introduces hit_cnt and miss_cnt counters:
      hit_cnt - this many times this state successfully pruned the search
      miss_cnt - this many times this state was not equivalent to other states
      (and that other states were added to state list)
      
      The heuristic introduced in this patch is:
      if (sl->miss_cnt > sl->hit_cnt * 3 + 3)
        /* drop this state from future considerations */
      
      Higher numbers increase max_states_per_insn (allow more states to be
      considered for pruning) and slow verification speed, but do not meaningfully
      reduce insn_processed metric.
      Lower numbers drop too many states and insn_processed increases too much.
      Many different formulas were considered.
      This one is simple and works well enough in practice.
      (the analysis was done on selftests/progs/* and on cilium programs)
      
      The end result is this heuristic improves verification speed by 10 times.
      Large synthetic programs that used to take a second more now take
      1/10 of a second.
      In cases where max_states_per_insn used to be 100 or more, now it's ~10.
      
      There is a slight increase in insn_processed for cilium progs:
                             before   after
      bpf_lb-DLB_L3.o 	1831	1838
      bpf_lb-DLB_L4.o 	3029	3218
      bpf_lb-DUNKNOWN.o 	1064	1064
      bpf_lxc-DDROP_ALL.o	26309	26935
      bpf_lxc-DUNKNOWN.o	33517	34439
      bpf_netdev.o		9713	9721
      bpf_overlay.o		6184	6184
      bpf_lcx_jit.o		37335	39389
      And 2-3 times improvement in the verification speed.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Reviewed-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      9f4686c4
    • A
      bpf: add verifier stats and log_level bit 2 · 06ee7115
      Alexei Starovoitov 提交于
      In order to understand the verifier bottlenecks add various stats
      and extend log_level:
      log_level 1 and 2 are kept as-is:
      bit 0 - level=1 - print every insn and verifier state at branch points
      bit 1 - level=2 - print every insn and verifier state at every insn
      bit 2 - level=4 - print verifier error and stats at the end of verification
      
      When verifier rejects the program the libbpf is trying to load the program twice.
      Once with log_level=0 (no messages, only error code is reported to user space)
      and second time with log_level=1 to tell the user why the verifier rejected it.
      
      With introduction of bit 2 - level=4 the libbpf can choose to always use that
      level and load programs once, since the verification speed is not affected and
      in case of error the verbose message will be available.
      
      Note that the verifier stats are not part of uapi just like all other
      verbose messages. They're expected to change in the future.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      06ee7115
  28. 14 3月, 2019 1 次提交
    • M
      bpf: Fix bpf_tcp_sock and bpf_sk_fullsock issue related to bpf_sk_release · 1b986589
      Martin KaFai Lau 提交于
      Lorenz Bauer [thanks!] reported that a ptr returned by bpf_tcp_sock(sk)
      can still be accessed after bpf_sk_release(sk).
      Both bpf_tcp_sock() and bpf_sk_fullsock() have the same issue.
      This patch addresses them together.
      
      A simple reproducer looks like this:
      
      	sk = bpf_sk_lookup_tcp();
      	/* if (!sk) ... */
      	tp = bpf_tcp_sock(sk);
      	/* if (!tp) ... */
      	bpf_sk_release(sk);
      	snd_cwnd = tp->snd_cwnd; /* oops! The verifier does not complain. */
      
      The problem is the verifier did not scrub the register's states of
      the tcp_sock ptr (tp) after bpf_sk_release(sk).
      
      [ Note that when calling bpf_tcp_sock(sk), the sk is not always
        refcount-acquired. e.g. bpf_tcp_sock(skb->sk). The verifier works
        fine for this case. ]
      
      Currently, the verifier does not track if a helper's return ptr (in REG_0)
      is "carry"-ing one of its argument's refcount status. To carry this info,
      the reg1->id needs to be stored in reg0.
      
      One approach was tried, like "reg0->id = reg1->id", when calling
      "bpf_tcp_sock()".  The main idea was to avoid adding another "ref_obj_id"
      for the same reg.  However, overlapping the NULL marking and ref
      tracking purpose in one "id" does not work well:
      
      	ref_sk = bpf_sk_lookup_tcp();
      	fullsock = bpf_sk_fullsock(ref_sk);
      	tp = bpf_tcp_sock(ref_sk);
      	if (!fullsock) {
      	     bpf_sk_release(ref_sk);
      	     return 0;
      	}
      	/* fullsock_reg->id is marked for NOT-NULL.
      	 * Same for tp_reg->id because they have the same id.
      	 */
      
      	/* oops. verifier did not complain about the missing !tp check */
      	snd_cwnd = tp->snd_cwnd;
      
      Hence, a new "ref_obj_id" is needed in "struct bpf_reg_state".
      With a new ref_obj_id, when bpf_sk_release(sk) is called, the verifier can
      scrub all reg states which has a ref_obj_id match.  It is done with the
      changes in release_reg_references() in this patch.
      
      While fixing it, sk_to_full_sk() is removed from bpf_tcp_sock() and
      bpf_sk_fullsock() to avoid these helpers from returning
      another ptr. It will make bpf_sk_release(tp) possible:
      
      	sk = bpf_sk_lookup_tcp();
      	/* if (!sk) ... */
      	tp = bpf_tcp_sock(sk);
      	/* if (!tp) ... */
      	bpf_sk_release(tp);
      
      A separate helper "bpf_get_listener_sock()" will be added in a later
      patch to do sk_to_full_sk().
      
      Misc change notes:
      - To allow bpf_sk_release(tp), the arg of bpf_sk_release() is changed
        from ARG_PTR_TO_SOCKET to ARG_PTR_TO_SOCK_COMMON.  ARG_PTR_TO_SOCKET
        is removed from bpf.h since no helper is using it.
      
      - arg_type_is_refcounted() is renamed to arg_type_may_be_refcounted()
        because ARG_PTR_TO_SOCK_COMMON is the only one and skb->sk is not
        refcounted.  All bpf_sk_release(), bpf_sk_fullsock() and bpf_tcp_sock()
        take ARG_PTR_TO_SOCK_COMMON.
      
      - check_refcount_ok() ensures is_acquire_function() cannot take
        arg_type_may_be_refcounted() as its argument.
      
      - The check_func_arg() can only allow one refcount-ed arg.  It is
        guaranteed by check_refcount_ok() which ensures at most one arg can be
        refcounted.  Hence, it is a verifier internal error if >1 refcount arg
        found in check_func_arg().
      
      - In release_reference(), release_reference_state() is called
        first to ensure a match on "reg->ref_obj_id" can be found before
        scrubbing the reg states with release_reg_references().
      
      - reg_is_refcounted() is no longer needed.
        1. In mark_ptr_or_null_regs(), its usage is replaced by
           "ref_obj_id && ref_obj_id == id" because,
           when is_null == true, release_reference_state() should only be
           called on the ref_obj_id obtained by a acquire helper (i.e.
           is_acquire_function() == true).  Otherwise, the following
           would happen:
      
      	sk = bpf_sk_lookup_tcp();
      	/* if (!sk) { ... } */
      	fullsock = bpf_sk_fullsock(sk);
      	if (!fullsock) {
      		/*
      		 * release_reference_state(fullsock_reg->ref_obj_id)
      		 * where fullsock_reg->ref_obj_id == sk_reg->ref_obj_id.
      		 *
      		 * Hence, the following bpf_sk_release(sk) will fail
      		 * because the ref state has already been released in the
      		 * earlier release_reference_state(fullsock_reg->ref_obj_id).
      		 */
      		bpf_sk_release(sk);
      	}
      
        2. In release_reg_references(), the current reg_is_refcounted() call
           is unnecessary because the id check is enough.
      
      - The type_is_refcounted() and type_is_refcounted_or_null()
        are no longer needed also because reg_is_refcounted() is removed.
      
      Fixes: 655a51e5 ("bpf: Add struct bpf_tcp_sock and BPF_FUNC_tcp_sock")
      Reported-by: NLorenz Bauer <lmb@cloudflare.com>
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      1b986589
  29. 02 2月, 2019 1 次提交
    • A
      bpf: introduce bpf_spin_lock · d83525ca
      Alexei Starovoitov 提交于
      Introduce 'struct bpf_spin_lock' and bpf_spin_lock/unlock() helpers to let
      bpf program serialize access to other variables.
      
      Example:
      struct hash_elem {
          int cnt;
          struct bpf_spin_lock lock;
      };
      struct hash_elem * val = bpf_map_lookup_elem(&hash_map, &key);
      if (val) {
          bpf_spin_lock(&val->lock);
          val->cnt++;
          bpf_spin_unlock(&val->lock);
      }
      
      Restrictions and safety checks:
      - bpf_spin_lock is only allowed inside HASH and ARRAY maps.
      - BTF description of the map is mandatory for safety analysis.
      - bpf program can take one bpf_spin_lock at a time, since two or more can
        cause dead locks.
      - only one 'struct bpf_spin_lock' is allowed per map element.
        It drastically simplifies implementation yet allows bpf program to use
        any number of bpf_spin_locks.
      - when bpf_spin_lock is taken the calls (either bpf2bpf or helpers) are not allowed.
      - bpf program must bpf_spin_unlock() before return.
      - bpf program can access 'struct bpf_spin_lock' only via
        bpf_spin_lock()/bpf_spin_unlock() helpers.
      - load/store into 'struct bpf_spin_lock lock;' field is not allowed.
      - to use bpf_spin_lock() helper the BTF description of map value must be
        a struct and have 'struct bpf_spin_lock anyname;' field at the top level.
        Nested lock inside another struct is not allowed.
      - syscall map_lookup doesn't copy bpf_spin_lock field to user space.
      - syscall map_update and program map_update do not update bpf_spin_lock field.
      - bpf_spin_lock cannot be on the stack or inside networking packet.
        bpf_spin_lock can only be inside HASH or ARRAY map value.
      - bpf_spin_lock is available to root only and to all program types.
      - bpf_spin_lock is not allowed in inner maps of map-in-map.
      - ld_abs is not allowed inside spin_lock-ed region.
      - tracing progs and socket filter progs cannot use bpf_spin_lock due to
        insufficient preemption checks
      
      Implementation details:
      - cgroup-bpf class of programs can nest with xdp/tc programs.
        Hence bpf_spin_lock is equivalent to spin_lock_irqsave.
        Other solutions to avoid nested bpf_spin_lock are possible.
        Like making sure that all networking progs run with softirq disabled.
        spin_lock_irqsave is the simplest and doesn't add overhead to the
        programs that don't use it.
      - arch_spinlock_t is used when its implemented as queued_spin_lock
      - archs can force their own arch_spinlock_t
      - on architectures where queued_spin_lock is not available and
        sizeof(arch_spinlock_t) != sizeof(__u32) trivial lock is used.
      - presence of bpf_spin_lock inside map value could have been indicated via
        extra flag during map_create, but specifying it via BTF is cleaner.
        It provides introspection for map key/value and reduces user mistakes.
      
      Next steps:
      - allow bpf_spin_lock in other map types (like cgroup local storage)
      - introduce BPF_F_LOCK flag for bpf_map_update() syscall and helper
        to request kernel to grab bpf_spin_lock before rewriting the value.
        That will serialize access to map elements.
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      d83525ca
  30. 24 1月, 2019 2 次提交
  31. 06 1月, 2019 1 次提交
  32. 03 1月, 2019 1 次提交
    • D
      bpf: prevent out of bounds speculation on pointer arithmetic · 979d63d5
      Daniel Borkmann 提交于
      Jann reported that the original commit back in b2157399
      ("bpf: prevent out-of-bounds speculation") was not sufficient
      to stop CPU from speculating out of bounds memory access:
      While b2157399 only focussed on masking array map access
      for unprivileged users for tail calls and data access such
      that the user provided index gets sanitized from BPF program
      and syscall side, there is still a more generic form affected
      from BPF programs that applies to most maps that hold user
      data in relation to dynamic map access when dealing with
      unknown scalars or "slow" known scalars as access offset, for
      example:
      
        - Load a map value pointer into R6
        - Load an index into R7
        - Do a slow computation (e.g. with a memory dependency) that
          loads a limit into R8 (e.g. load the limit from a map for
          high latency, then mask it to make the verifier happy)
        - Exit if R7 >= R8 (mispredicted branch)
        - Load R0 = R6[R7]
        - Load R0 = R6[R0]
      
      For unknown scalars there are two options in the BPF verifier
      where we could derive knowledge from in order to guarantee
      safe access to the memory: i) While </>/<=/>= variants won't
      allow to derive any lower or upper bounds from the unknown
      scalar where it would be safe to add it to the map value
      pointer, it is possible through ==/!= test however. ii) another
      option is to transform the unknown scalar into a known scalar,
      for example, through ALU ops combination such as R &= <imm>
      followed by R |= <imm> or any similar combination where the
      original information from the unknown scalar would be destroyed
      entirely leaving R with a constant. The initial slow load still
      precedes the latter ALU ops on that register, so the CPU
      executes speculatively from that point. Once we have the known
      scalar, any compare operation would work then. A third option
      only involving registers with known scalars could be crafted
      as described in [0] where a CPU port (e.g. Slow Int unit)
      would be filled with many dependent computations such that
      the subsequent condition depending on its outcome has to wait
      for evaluation on its execution port and thereby executing
      speculatively if the speculated code can be scheduled on a
      different execution port, or any other form of mistraining
      as described in [1], for example. Given this is not limited
      to only unknown scalars, not only map but also stack access
      is affected since both is accessible for unprivileged users
      and could potentially be used for out of bounds access under
      speculation.
      
      In order to prevent any of these cases, the verifier is now
      sanitizing pointer arithmetic on the offset such that any
      out of bounds speculation would be masked in a way where the
      pointer arithmetic result in the destination register will
      stay unchanged, meaning offset masked into zero similar as
      in array_index_nospec() case. With regards to implementation,
      there are three options that were considered: i) new insn
      for sanitation, ii) push/pop insn and sanitation as inlined
      BPF, iii) reuse of ax register and sanitation as inlined BPF.
      
      Option i) has the downside that we end up using from reserved
      bits in the opcode space, but also that we would require
      each JIT to emit masking as native arch opcodes meaning
      mitigation would have slow adoption till everyone implements
      it eventually which is counter-productive. Option ii) and iii)
      have both in common that a temporary register is needed in
      order to implement the sanitation as inlined BPF since we
      are not allowed to modify the source register. While a push /
      pop insn in ii) would be useful to have in any case, it
      requires once again that every JIT needs to implement it
      first. While possible, amount of changes needed would also
      be unsuitable for a -stable patch. Therefore, the path which
      has fewer changes, less BPF instructions for the mitigation
      and does not require anything to be changed in the JITs is
      option iii) which this work is pursuing. The ax register is
      already mapped to a register in all JITs (modulo arm32 where
      it's mapped to stack as various other BPF registers there)
      and used in constant blinding for JITs-only so far. It can
      be reused for verifier rewrites under certain constraints.
      The interpreter's tmp "register" has therefore been remapped
      into extending the register set with hidden ax register and
      reusing that for a number of instructions that needed the
      prior temporary variable internally (e.g. div, mod). This
      allows for zero increase in stack space usage in the interpreter,
      and enables (restricted) generic use in rewrites otherwise as
      long as such a patchlet does not make use of these instructions.
      The sanitation mask is dynamic and relative to the offset the
      map value or stack pointer currently holds.
      
      There are various cases that need to be taken under consideration
      for the masking, e.g. such operation could look as follows:
      ptr += val or val += ptr or ptr -= val. Thus, the value to be
      sanitized could reside either in source or in destination
      register, and the limit is different depending on whether
      the ALU op is addition or subtraction and depending on the
      current known and bounded offset. The limit is derived as
      follows: limit := max_value_size - (smin_value + off). For
      subtraction: limit := umax_value + off. This holds because
      we do not allow any pointer arithmetic that would
      temporarily go out of bounds or would have an unknown
      value with mixed signed bounds where it is unclear at
      verification time whether the actual runtime value would
      be either negative or positive. For example, we have a
      derived map pointer value with constant offset and bounded
      one, so limit based on smin_value works because the verifier
      requires that statically analyzed arithmetic on the pointer
      must be in bounds, and thus it checks if resulting
      smin_value + off and umax_value + off is still within map
      value bounds at time of arithmetic in addition to time of
      access. Similarly, for the case of stack access we derive
      the limit as follows: MAX_BPF_STACK + off for subtraction
      and -off for the case of addition where off := ptr_reg->off +
      ptr_reg->var_off.value. Subtraction is a special case for
      the masking which can be in form of ptr += -val, ptr -= -val,
      or ptr -= val. In the first two cases where we know that
      the value is negative, we need to temporarily negate the
      value in order to do the sanitation on a positive value
      where we later swap the ALU op, and restore original source
      register if the value was in source.
      
      The sanitation of pointer arithmetic alone is still not fully
      sufficient as is, since a scenario like the following could
      happen ...
      
        PTR += 0x1000 (e.g. K-based imm)
        PTR -= BIG_NUMBER_WITH_SLOW_COMPARISON
        PTR += 0x1000
        PTR -= BIG_NUMBER_WITH_SLOW_COMPARISON
        [...]
      
      ... which under speculation could end up as ...
      
        PTR += 0x1000
        PTR -= 0 [ truncated by mitigation ]
        PTR += 0x1000
        PTR -= 0 [ truncated by mitigation ]
        [...]
      
      ... and therefore still access out of bounds. To prevent such
      case, the verifier is also analyzing safety for potential out
      of bounds access under speculative execution. Meaning, it is
      also simulating pointer access under truncation. We therefore
      "branch off" and push the current verification state after the
      ALU operation with known 0 to the verification stack for later
      analysis. Given the current path analysis succeeded it is
      likely that the one under speculation can be pruned. In any
      case, it is also subject to existing complexity limits and
      therefore anything beyond this point will be rejected. In
      terms of pruning, it needs to be ensured that the verification
      state from speculative execution simulation must never prune
      a non-speculative execution path, therefore, we mark verifier
      state accordingly at the time of push_stack(). If verifier
      detects out of bounds access under speculative execution from
      one of the possible paths that includes a truncation, it will
      reject such program.
      
      Given we mask every reg-based pointer arithmetic for
      unprivileged programs, we've been looking into how it could
      affect real-world programs in terms of size increase. As the
      majority of programs are targeted for privileged-only use
      case, we've unconditionally enabled masking (with its alu
      restrictions on top of it) for privileged programs for the
      sake of testing in order to check i) whether they get rejected
      in its current form, and ii) by how much the number of
      instructions and size will increase. We've tested this by
      using Katran, Cilium and test_l4lb from the kernel selftests.
      For Katran we've evaluated balancer_kern.o, Cilium bpf_lxc.o
      and an older test object bpf_lxc_opt_-DUNKNOWN.o and l4lb
      we've used test_l4lb.o as well as test_l4lb_noinline.o. We
      found that none of the programs got rejected by the verifier
      with this change, and that impact is rather minimal to none.
      balancer_kern.o had 13,904 bytes (1,738 insns) xlated and
      7,797 bytes JITed before and after the change. Most complex
      program in bpf_lxc.o had 30,544 bytes (3,817 insns) xlated
      and 18,538 bytes JITed before and after and none of the other
      tail call programs in bpf_lxc.o had any changes either. For
      the older bpf_lxc_opt_-DUNKNOWN.o object we found a small
      increase from 20,616 bytes (2,576 insns) and 12,536 bytes JITed
      before to 20,664 bytes (2,582 insns) and 12,558 bytes JITed
      after the change. Other programs from that object file had
      similar small increase. Both test_l4lb.o had no change and
      remained at 6,544 bytes (817 insns) xlated and 3,401 bytes
      JITed and for test_l4lb_noinline.o constant at 5,080 bytes
      (634 insns) xlated and 3,313 bytes JITed. This can be explained
      in that LLVM typically optimizes stack based pointer arithmetic
      by using K-based operations and that use of dynamic map access
      is not overly frequent. However, in future we may decide to
      optimize the algorithm further under known guarantees from
      branch and value speculation. Latter seems also unclear in
      terms of prediction heuristics that today's CPUs apply as well
      as whether there could be collisions in e.g. the predictor's
      Value History/Pattern Table for triggering out of bounds access,
      thus masking is performed unconditionally at this point but could
      be subject to relaxation later on. We were generally also
      brainstorming various other approaches for mitigation, but the
      blocker was always lack of available registers at runtime and/or
      overhead for runtime tracking of limits belonging to a specific
      pointer. Thus, we found this to be minimally intrusive under
      given constraints.
      
      With that in place, a simple example with sanitized access on
      unprivileged load at post-verification time looks as follows:
      
        # bpftool prog dump xlated id 282
        [...]
        28: (79) r1 = *(u64 *)(r7 +0)
        29: (79) r2 = *(u64 *)(r7 +8)
        30: (57) r1 &= 15
        31: (79) r3 = *(u64 *)(r0 +4608)
        32: (57) r3 &= 1
        33: (47) r3 |= 1
        34: (2d) if r2 > r3 goto pc+19
        35: (b4) (u32) r11 = (u32) 20479  |
        36: (1f) r11 -= r2                | Dynamic sanitation for pointer
        37: (4f) r11 |= r2                | arithmetic with registers
        38: (87) r11 = -r11               | containing bounded or known
        39: (c7) r11 s>>= 63              | scalars in order to prevent
        40: (5f) r11 &= r2                | out of bounds speculation.
        41: (0f) r4 += r11                |
        42: (71) r4 = *(u8 *)(r4 +0)
        43: (6f) r4 <<= r1
        [...]
      
      For the case where the scalar sits in the destination register
      as opposed to the source register, the following code is emitted
      for the above example:
      
        [...]
        16: (b4) (u32) r11 = (u32) 20479
        17: (1f) r11 -= r2
        18: (4f) r11 |= r2
        19: (87) r11 = -r11
        20: (c7) r11 s>>= 63
        21: (5f) r2 &= r11
        22: (0f) r2 += r0
        23: (61) r0 = *(u32 *)(r2 +0)
        [...]
      
      JIT blinding example with non-conflicting use of r10:
      
        [...]
         d5:	je     0x0000000000000106    _
         d7:	mov    0x0(%rax),%edi       |
         da:	mov    $0xf153246,%r10d     | Index load from map value and
         e0:	xor    $0xf153259,%r10      | (const blinded) mask with 0x1f.
         e7:	and    %r10,%rdi            |_
         ea:	mov    $0x2f,%r10d          |
         f0:	sub    %rdi,%r10            | Sanitized addition. Both use r10
         f3:	or     %rdi,%r10            | but do not interfere with each
         f6:	neg    %r10                 | other. (Neither do these instructions
         f9:	sar    $0x3f,%r10           | interfere with the use of ax as temp
         fd:	and    %r10,%rdi            | in interpreter.)
        100:	add    %rax,%rdi            |_
        103:	mov    0x0(%rdi),%eax
       [...]
      
      Tested that it fixes Jann's reproducer, and also checked that test_verifier
      and test_progs suite with interpreter, JIT and JIT with hardening enabled
      on x86-64 and arm64 runs successfully.
      
        [0] Speculose: Analyzing the Security Implications of Speculative
            Execution in CPUs, Giorgi Maisuradze and Christian Rossow,
            https://arxiv.org/pdf/1801.04084.pdf
      
        [1] A Systematic Evaluation of Transient Execution Attacks and
            Defenses, Claudio Canella, Jo Van Bulck, Michael Schwarz,
            Moritz Lipp, Benjamin von Berg, Philipp Ortner, Frank Piessens,
            Dmitry Evtyushkin, Daniel Gruss,
            https://arxiv.org/pdf/1811.05441.pdf
      
      Fixes: b2157399 ("bpf: prevent out-of-bounds speculation")
      Reported-by: NJann Horn <jannh@google.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      979d63d5