1. 31 3月, 2020 1 次提交
    • J
      bpf: Verifier, do explicit ALU32 bounds tracking · 3f50f132
      John Fastabend 提交于
      It is not possible for the current verifier to track ALU32 and JMP ops
      correctly. This can result in the verifier aborting with errors even though
      the program should be verifiable. BPF codes that hit this can work around
      it by changin int variables to 64-bit types, marking variables volatile,
      etc. But this is all very ugly so it would be better to avoid these tricks.
      
      But, the main reason to address this now is do_refine_retval_range() was
      assuming return values could not be negative. Once we fixed this code that
      was previously working will no longer work. See do_refine_retval_range()
      patch for details. And we don't want to suddenly cause programs that used
      to work to fail.
      
      The simplest example code snippet that illustrates the problem is likely
      this,
      
       53: w8 = w0                    // r8 <- [0, S32_MAX],
                                      // w8 <- [-S32_MIN, X]
       54: w8 <s 0                    // r8 <- [0, U32_MAX]
                                      // w8 <- [0, X]
      
      The expected 64-bit and 32-bit bounds after each line are shown on the
      right. The current issue is without the w* bounds we are forced to use
      the worst case bound of [0, U32_MAX]. To resolve this type of case,
      jmp32 creating divergent 32-bit bounds from 64-bit bounds, we add explicit
      32-bit register bounds s32_{min|max}_value and u32_{min|max}_value. Then
      from branch_taken logic creating new bounds we can track 32-bit bounds
      explicitly.
      
      The next case we observed is ALU ops after the jmp32,
      
       53: w8 = w0                    // r8 <- [0, S32_MAX],
                                      // w8 <- [-S32_MIN, X]
       54: w8 <s 0                    // r8 <- [0, U32_MAX]
                                      // w8 <- [0, X]
       55: w8 += 1                    // r8 <- [0, U32_MAX+1]
                                      // w8 <- [0, X+1]
      
      In order to keep the bounds accurate at this point we also need to track
      ALU32 ops. To do this we add explicit ALU32 logic for each of the ALU
      ops, mov, add, sub, etc.
      
      Finally there is a question of how and when to merge bounds. The cases
      enumerate here,
      
      1. MOV ALU32   - zext 32-bit -> 64-bit
      2. MOV ALU64   - copy 64-bit -> 32-bit
      3. op  ALU32   - zext 32-bit -> 64-bit
      4. op  ALU64   - n/a
      5. jmp ALU32   - 64-bit: var32_off | upper_32_bits(var64_off)
      6. jmp ALU64   - 32-bit: (>> (<< var64_off))
      
      Details for each case,
      
      For "MOV ALU32" BPF arch zero extends so we simply copy the bounds
      from 32-bit into 64-bit ensuring we truncate var_off and 64-bit
      bounds correctly. See zext_32_to_64.
      
      For "MOV ALU64" copy all bounds including 32-bit into new register. If
      the src register had 32-bit bounds the dst register will as well.
      
      For "op ALU32" zero extend 32-bit into 64-bit the same as move,
      see zext_32_to_64.
      
      For "op ALU64" calculate both 32-bit and 64-bit bounds no merging
      is done here. Except we have a special case. When RSH or ARSH is
      done we can't simply ignore shifting bits from 64-bit reg into the
      32-bit subreg. So currently just push bounds from 64-bit into 32-bit.
      This will be correct in the sense that they will represent a valid
      state of the register. However we could lose some accuracy if an
      ARSH is following a jmp32 operation. We can handle this special
      case in a follow up series.
      
      For "jmp ALU32" mark 64-bit reg unknown and recalculate 64-bit bounds
      from tnum by setting var_off to ((<<(>>var_off)) | var32_off). We
      special case if 64-bit bounds has zero'd upper 32bits at which point
      we can simply copy 32-bit bounds into 64-bit register. This catches
      a common compiler trick where upper 32-bits are zeroed and then
      32-bit ops are used followed by a 64-bit compare or 64-bit op on
      a pointer. See __reg_combine_64_into_32().
      
      For "jmp ALU64" cast the bounds of the 64bit to their 32-bit
      counterpart. For example s32_min_value = (s32)reg->smin_value. For
      tnum use only the lower 32bits via, (>>(<<var_off)). See
      __reg_combine_64_into_32().
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/158560419880.10843.11448220440809118343.stgit@john-Precision-5820-Tower
      3f50f132
  2. 11 1月, 2020 1 次提交
    • A
      bpf: Introduce function-by-function verification · 51c39bb1
      Alexei Starovoitov 提交于
      New llvm and old llvm with libbpf help produce BTF that distinguish global and
      static functions. Unlike arguments of static function the arguments of global
      functions cannot be removed or optimized away by llvm. The compiler has to use
      exactly the arguments specified in a function prototype. The argument type
      information allows the verifier validate each global function independently.
      For now only supported argument types are pointer to context and scalars. In
      the future pointers to structures, sizes, pointer to packet data can be
      supported as well. Consider the following example:
      
      static int f1(int ...)
      {
        ...
      }
      
      int f3(int b);
      
      int f2(int a)
      {
        f1(a) + f3(a);
      }
      
      int f3(int b)
      {
        ...
      }
      
      int main(...)
      {
        f1(...) + f2(...) + f3(...);
      }
      
      The verifier will start its safety checks from the first global function f2().
      It will recursively descend into f1() because it's static. Then it will check
      that arguments match for the f3() invocation inside f2(). It will not descend
      into f3(). It will finish f2() that has to be successfully verified for all
      possible values of 'a'. Then it will proceed with f3(). That function also has
      to be safe for all possible values of 'b'. Then it will start subprog 0 (which
      is main() function). It will recursively descend into f1() and will skip full
      check of f2() and f3(), since they are global. The order of processing global
      functions doesn't affect safety, since all global functions must be proven safe
      based on their arguments only.
      
      Such function by function verification can drastically improve speed of the
      verification and reduce complexity.
      
      Note that the stack limit of 512 still applies to the call chain regardless whether
      functions were static or global. The nested level of 8 also still applies. The
      same recursion prevention checks are in place as well.
      
      The type information and static/global kind is preserved after the verification
      hence in the above example global function f2() and f3() can be replaced later
      by equivalent functions with the same types that are loaded and verified later
      without affecting safety of this main() program. Such replacement (re-linking)
      of global functions is a subject of future patches.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Link: https://lore.kernel.org/bpf/20200110064124.1760511-3-ast@kernel.org
      51c39bb1
  3. 25 11月, 2019 1 次提交
    • D
      bpf: Constant map key tracking for prog array pokes · d2e4c1e6
      Daniel Borkmann 提交于
      Add tracking of constant keys into tail call maps. The signature of
      bpf_tail_call_proto is that arg1 is ctx, arg2 map pointer and arg3
      is a index key. The direct call approach for tail calls can be enabled
      if the verifier asserted that for all branches leading to the tail call
      helper invocation, the map pointer and index key were both constant
      and the same.
      
      Tracking of map pointers we already do from prior work via c93552c4
      ("bpf: properly enforce index mask to prevent out-of-bounds speculation")
      and 09772d92 ("bpf: avoid retpoline for lookup/update/ delete calls
      on maps").
      
      Given the tail call map index key is not on stack but directly in the
      register, we can add similar tracking approach and later in fixup_bpf_calls()
      add a poke descriptor to the progs poke_tab with the relevant information
      for the JITing phase.
      
      We internally reuse insn->imm for the rewritten BPF_JMP | BPF_TAIL_CALL
      instruction in order to point into the prog's poke_tab, and keep insn->imm
      as 0 as indicator that current indirect tail call emission must be used.
      Note that publishing to the tracker must happen at the end of fixup_bpf_calls()
      since adding elements to the poke_tab reallocates its memory, so we need
      to wait until its in final state.
      
      Future work can generalize and add similar approach to optimize plain
      array map lookups. Difference there is that we need to look into the key
      value that sits on stack. For clarity in bpf_insn_aux_data, map_state
      has been renamed into map_ptr_state, so we get map_{ptr,key}_state as
      trackers.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Link: https://lore.kernel.org/bpf/e8db37f6b2ae60402fa40216c96738ee9b316c32.1574452833.git.daniel@iogearbox.net
      d2e4c1e6
  4. 16 11月, 2019 1 次提交
  5. 17 10月, 2019 2 次提交
    • A
      bpf: Implement accurate raw_tp context access via BTF · 9e15db66
      Alexei Starovoitov 提交于
      libbpf analyzes bpf C program, searches in-kernel BTF for given type name
      and stores it into expected_attach_type.
      The kernel verifier expects this btf_id to point to something like:
      typedef void (*btf_trace_kfree_skb)(void *, struct sk_buff *skb, void *loc);
      which represents signature of raw_tracepoint "kfree_skb".
      
      Then btf_ctx_access() matches ctx+0 access in bpf program with 'skb'
      and 'ctx+8' access with 'loc' arguments of "kfree_skb" tracepoint.
      In first case it passes btf_id of 'struct sk_buff *' back to the verifier core
      and 'void *' in second case.
      
      Then the verifier tracks PTR_TO_BTF_ID as any other pointer type.
      Like PTR_TO_SOCKET points to 'struct bpf_sock',
      PTR_TO_TCP_SOCK points to 'struct bpf_tcp_sock', and so on.
      PTR_TO_BTF_ID points to in-kernel structs.
      If 1234 is btf_id of 'struct sk_buff' in vmlinux's BTF
      then PTR_TO_BTF_ID#1234 points to one of in kernel skbs.
      
      When PTR_TO_BTF_ID#1234 is dereferenced (like r2 = *(u64 *)r1 + 32)
      the btf_struct_access() checks which field of 'struct sk_buff' is
      at offset 32. Checks that size of access matches type definition
      of the field and continues to track the dereferenced type.
      If that field was a pointer to 'struct net_device' the r2's type
      will be PTR_TO_BTF_ID#456. Where 456 is btf_id of 'struct net_device'
      in vmlinux's BTF.
      
      Such verifier analysis prevents "cheating" in BPF C program.
      The program cannot cast arbitrary pointer to 'struct sk_buff *'
      and access it. C compiler would allow type cast, of course,
      but the verifier will notice type mismatch based on BPF assembly
      and in-kernel BTF.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/20191016032505.2089704-7-ast@kernel.org
      9e15db66
    • A
      bpf: Process in-kernel BTF · 8580ac94
      Alexei Starovoitov 提交于
      If in-kernel BTF exists parse it and prepare 'struct btf *btf_vmlinux'
      for further use by the verifier.
      In-kernel BTF is trusted just like kallsyms and other build artifacts
      embedded into vmlinux.
      Yet run this BTF image through BTF verifier to make sure
      that it is valid and it wasn't mangled during the build.
      If in-kernel BTF is incorrect it means either gcc or pahole or kernel
      are buggy. In such case disallow loading BPF programs.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/20191016032505.2089704-4-ast@kernel.org
      8580ac94
  6. 28 8月, 2019 1 次提交
  7. 19 6月, 2019 2 次提交
    • A
      bpf: precise scalar_value tracking · b5dc0163
      Alexei Starovoitov 提交于
      Introduce precision tracking logic that
      helps cilium programs the most:
                        old clang  old clang    new clang  new clang
                                with all patches         with all patches
      bpf_lb-DLB_L3.o      1838     2283         1923       1863
      bpf_lb-DLB_L4.o      3218     2657         3077       2468
      bpf_lb-DUNKNOWN.o    1064     545          1062       544
      bpf_lxc-DDROP_ALL.o  26935    23045        166729     22629
      bpf_lxc-DUNKNOWN.o   34439    35240        174607     28805
      bpf_netdev.o         9721     8753         8407       6801
      bpf_overlay.o        6184     7901         5420       4754
      bpf_lxc_jit.o        39389    50925        39389      50925
      
      Consider code:
      654: (85) call bpf_get_hash_recalc#34
      655: (bf) r7 = r0
      656: (15) if r8 == 0x0 goto pc+29
      657: (bf) r2 = r10
      658: (07) r2 += -48
      659: (18) r1 = 0xffff8881e41e1b00
      661: (85) call bpf_map_lookup_elem#1
      662: (15) if r0 == 0x0 goto pc+23
      663: (69) r1 = *(u16 *)(r0 +0)
      664: (15) if r1 == 0x0 goto pc+21
      665: (bf) r8 = r7
      666: (57) r8 &= 65535
      667: (bf) r2 = r8
      668: (3f) r2 /= r1
      669: (2f) r2 *= r1
      670: (bf) r1 = r8
      671: (1f) r1 -= r2
      672: (57) r1 &= 255
      673: (25) if r1 > 0x1e goto pc+12
       R0=map_value(id=0,off=0,ks=20,vs=64,imm=0) R1_w=inv(id=0,umax_value=30,var_off=(0x0; 0x1f))
      674: (67) r1 <<= 1
      675: (0f) r0 += r1
      
      At this point the verifier will notice that scalar R1 is used in map pointer adjustment.
      R1 has to be precise for later operations on R0 to be validated properly.
      
      The verifier will backtrack the above code in the following way:
      last_idx 675 first_idx 664
      regs=2 stack=0 before 675: (0f) r0 += r1         // started backtracking R1 regs=2 is a bitmask
      regs=2 stack=0 before 674: (67) r1 <<= 1
      regs=2 stack=0 before 673: (25) if r1 > 0x1e goto pc+12
      regs=2 stack=0 before 672: (57) r1 &= 255
      regs=2 stack=0 before 671: (1f) r1 -= r2         // now both R1 and R2 has to be precise -> regs=6 mask
      regs=6 stack=0 before 670: (bf) r1 = r8          // after this insn R8 and R2 has to be precise
      regs=104 stack=0 before 669: (2f) r2 *= r1       // after this one R8, R2, and R1
      regs=106 stack=0 before 668: (3f) r2 /= r1
      regs=106 stack=0 before 667: (bf) r2 = r8
      regs=102 stack=0 before 666: (57) r8 &= 65535
      regs=102 stack=0 before 665: (bf) r8 = r7
      regs=82 stack=0 before 664: (15) if r1 == 0x0 goto pc+21
       // this is the end of verifier state. The following regs will be marked precised:
       R1_rw=invP(id=0,umax_value=65535,var_off=(0x0; 0xffff)) R7_rw=invP(id=0)
      parent didn't have regs=82 stack=0 marks         // so backtracking continues into parent state
      last_idx 663 first_idx 655
      regs=82 stack=0 before 663: (69) r1 = *(u16 *)(r0 +0)   // R1 was assigned no need to track it further
      regs=80 stack=0 before 662: (15) if r0 == 0x0 goto pc+23    // keep tracking R7
      regs=80 stack=0 before 661: (85) call bpf_map_lookup_elem#1  // keep tracking R7
      regs=80 stack=0 before 659: (18) r1 = 0xffff8881e41e1b00
      regs=80 stack=0 before 658: (07) r2 += -48
      regs=80 stack=0 before 657: (bf) r2 = r10
      regs=80 stack=0 before 656: (15) if r8 == 0x0 goto pc+29
      regs=80 stack=0 before 655: (bf) r7 = r0                // here the assignment into R7
       // mark R0 to be precise:
       R0_rw=invP(id=0)
      parent didn't have regs=1 stack=0 marks                 // regs=1 -> tracking R0
      last_idx 654 first_idx 644
      regs=1 stack=0 before 654: (85) call bpf_get_hash_recalc#34 // and in the parent frame it was a return value
        // nothing further to backtrack
      
      Two scalar registers not marked precise are equivalent from state pruning point of view.
      More details in the patch comments.
      
      It doesn't support bpf2bpf calls yet and enabled for root only.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      b5dc0163
    • A
      bpf: introduce bounded loops · 2589726d
      Alexei Starovoitov 提交于
      Allow the verifier to validate the loops by simulating their execution.
      Exisiting programs have used '#pragma unroll' to unroll the loops
      by the compiler. Instead let the verifier simulate all iterations
      of the loop.
      In order to do that introduce parentage chain of bpf_verifier_state and
      'branches' counter for the number of branches left to explore.
      See more detailed algorithm description in bpf_verifier.h
      
      This algorithm borrows the key idea from Edward Cree approach:
      https://patchwork.ozlabs.org/patch/877222/
      Additional state pruning heuristics make such brute force loop walk
      practical even for large loops.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      2589726d
  8. 31 5月, 2019 1 次提交
  9. 25 5月, 2019 1 次提交
    • J
      bpf: verifier: mark verified-insn with sub-register zext flag · 5327ed3d
      Jiong Wang 提交于
      eBPF ISA specification requires high 32-bit cleared when low 32-bit
      sub-register is written. This applies to destination register of ALU32 etc.
      JIT back-ends must guarantee this semantic when doing code-gen. x86_64 and
      AArch64 ISA has the same semantics, so the corresponding JIT back-end
      doesn't need to do extra work.
      
      However, 32-bit arches (arm, x86, nfp etc.) and some other 64-bit arches
      (PowerPC, SPARC etc) need to do explicit zero extension to meet this
      requirement, otherwise code like the following will fail.
      
        u64_value = (u64) u32_value
        ... other uses of u64_value
      
      This is because compiler could exploit the semantic described above and
      save those zero extensions for extending u32_value to u64_value, these JIT
      back-ends are expected to guarantee this through inserting extra zero
      extensions which however could be a significant increase on the code size.
      Some benchmarks show there could be ~40% sub-register writes out of total
      insns, meaning at least ~40% extra code-gen.
      
      One observation is these extra zero extensions are not always necessary.
      Take above code snippet for example, it is possible u32_value will never be
      casted into a u64, the value of high 32-bit of u32_value then could be
      ignored and extra zero extension could be eliminated.
      
      This patch implements this idea, insns defining sub-registers will be
      marked when the high 32-bit of the defined sub-register matters. For
      those unmarked insns, it is safe to eliminate high 32-bit clearnace for
      them.
      
      Algo:
       - Split read flags into READ32 and READ64.
      
       - Record index of insn that does sub-register write. Keep the index inside
         reg state and update it during verifier insn walking.
      
       - A full register read on a sub-register marks its definition insn as
         needing zero extension on dst register.
      
         A new sub-register write overrides the old one.
      
       - When propagating read64 during path pruning, also mark any insn defining
         a sub-register that is read in the pruned path as full-register.
      Reviewed-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: NJiong Wang <jiong.wang@netronome.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      5327ed3d
  10. 24 5月, 2019 2 次提交
    • A
      bpf: convert explored_states to hash table · dc2a4ebc
      Alexei Starovoitov 提交于
      All prune points inside a callee bpf function most likely will have
      different callsites. For example, if function foo() is called from
      two callsites the half of explored states in all prune points in foo()
      will be useless for subsequent walking of one of those callsites.
      Fortunately explored_states pruning heuristics keeps the number of states
      per prune point small, but walking these states is still a waste of cpu
      time when the callsite of the current state is different from the callsite
      of the explored state.
      
      To improve pruning logic convert explored_states into hash table and
      use simple insn_idx ^ callsite hash to select hash bucket.
      This optimization has no effect on programs without bpf2bpf calls
      and drastically improves programs with calls.
      In the later case it reduces total memory consumption in 1M scale tests
      by almost 3 times (peak_states drops from 5752 to 2016).
      
      Care should be taken when comparing the states for equivalency.
      Since the same hash bucket can now contain states with different indices
      the insn_idx has to be part of verifier_state and compared.
      
      Different hash table sizes and different hash functions were explored,
      but the results were not significantly better vs this patch.
      They can be improved in the future.
      
      Hit/miss heuristic is not counting index miscompare as a miss.
      Otherwise verifier stats become unstable when experimenting
      with different hash functions.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      dc2a4ebc
    • A
      bpf: split explored_states · a8f500af
      Alexei Starovoitov 提交于
      split explored_states into prune_point boolean mark
      and link list of explored states.
      This removes STATE_LIST_MARK hack and allows marks to be separate from states.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      a8f500af
  11. 23 4月, 2019 1 次提交
  12. 10 4月, 2019 1 次提交
    • D
      bpf: implement lookup-free direct value access for maps · d8eca5bb
      Daniel Borkmann 提交于
      This generic extension to BPF maps allows for directly loading
      an address residing inside a BPF map value as a single BPF
      ldimm64 instruction!
      
      The idea is similar to what BPF_PSEUDO_MAP_FD does today, which
      is a special src_reg flag for ldimm64 instruction that indicates
      that inside the first part of the double insns's imm field is a
      file descriptor which the verifier then replaces as a full 64bit
      address of the map into both imm parts. For the newly added
      BPF_PSEUDO_MAP_VALUE src_reg flag, the idea is the following:
      the first part of the double insns's imm field is again a file
      descriptor corresponding to the map, and the second part of the
      imm field is an offset into the value. The verifier will then
      replace both imm parts with an address that points into the BPF
      map value at the given value offset for maps that support this
      operation. Currently supported is array map with single entry.
      It is possible to support more than just single map element by
      reusing both 16bit off fields of the insns as a map index, so
      full array map lookup could be expressed that way. It hasn't
      been implemented here due to lack of concrete use case, but
      could easily be done so in future in a compatible way, since
      both off fields right now have to be 0 and would correctly
      denote a map index 0.
      
      The BPF_PSEUDO_MAP_VALUE is a distinct flag as otherwise with
      BPF_PSEUDO_MAP_FD we could not differ offset 0 between load of
      map pointer versus load of map's value at offset 0, and changing
      BPF_PSEUDO_MAP_FD's encoding into off by one to differ between
      regular map pointer and map value pointer would add unnecessary
      complexity and increases barrier for debugability thus less
      suitable. Using the second part of the imm field as an offset
      into the value does /not/ come with limitations since maximum
      possible value size is in u32 universe anyway.
      
      This optimization allows for efficiently retrieving an address
      to a map value memory area without having to issue a helper call
      which needs to prepare registers according to calling convention,
      etc, without needing the extra NULL test, and without having to
      add the offset in an additional instruction to the value base
      pointer. The verifier then treats the destination register as
      PTR_TO_MAP_VALUE with constant reg->off from the user passed
      offset from the second imm field, and guarantees that this is
      within bounds of the map value. Any subsequent operations are
      normally treated as typical map value handling without anything
      extra needed from verification side.
      
      The two map operations for direct value access have been added to
      array map for now. In future other types could be supported as
      well depending on the use case. The main use case for this commit
      is to allow for BPF loader support for global variables that
      reside in .data/.rodata/.bss sections such that we can directly
      load the address of them with minimal additional infrastructure
      required. Loader support has been added in subsequent commits for
      libbpf library.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      d8eca5bb
  13. 04 4月, 2019 2 次提交
    • A
      bpf: improve verification speed by droping states · 9f4686c4
      Alexei Starovoitov 提交于
      Branch instructions, branch targets and calls in a bpf program are
      the places where the verifier remembers states that led to successful
      verification of the program.
      These states are used to prune brute force program analysis.
      For unprivileged programs there is a limit of 64 states per such
      'branching' instructions (maximum length is tracked by max_states_per_insn
      counter introduced in the previous patch).
      Simply reducing this threshold to 32 or lower increases insn_processed
      metric to the point that small valid programs get rejected.
      For root programs there is no limit and cilium programs can have
      max_states_per_insn to be 100 or higher.
      Walking 100+ states multiplied by number of 'branching' insns during
      verification consumes significant amount of cpu time.
      Turned out simple LRU-like mechanism can be used to remove states
      that unlikely will be helpful in future search pruning.
      This patch introduces hit_cnt and miss_cnt counters:
      hit_cnt - this many times this state successfully pruned the search
      miss_cnt - this many times this state was not equivalent to other states
      (and that other states were added to state list)
      
      The heuristic introduced in this patch is:
      if (sl->miss_cnt > sl->hit_cnt * 3 + 3)
        /* drop this state from future considerations */
      
      Higher numbers increase max_states_per_insn (allow more states to be
      considered for pruning) and slow verification speed, but do not meaningfully
      reduce insn_processed metric.
      Lower numbers drop too many states and insn_processed increases too much.
      Many different formulas were considered.
      This one is simple and works well enough in practice.
      (the analysis was done on selftests/progs/* and on cilium programs)
      
      The end result is this heuristic improves verification speed by 10 times.
      Large synthetic programs that used to take a second more now take
      1/10 of a second.
      In cases where max_states_per_insn used to be 100 or more, now it's ~10.
      
      There is a slight increase in insn_processed for cilium progs:
                             before   after
      bpf_lb-DLB_L3.o 	1831	1838
      bpf_lb-DLB_L4.o 	3029	3218
      bpf_lb-DUNKNOWN.o 	1064	1064
      bpf_lxc-DDROP_ALL.o	26309	26935
      bpf_lxc-DUNKNOWN.o	33517	34439
      bpf_netdev.o		9713	9721
      bpf_overlay.o		6184	6184
      bpf_lcx_jit.o		37335	39389
      And 2-3 times improvement in the verification speed.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Reviewed-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      9f4686c4
    • A
      bpf: add verifier stats and log_level bit 2 · 06ee7115
      Alexei Starovoitov 提交于
      In order to understand the verifier bottlenecks add various stats
      and extend log_level:
      log_level 1 and 2 are kept as-is:
      bit 0 - level=1 - print every insn and verifier state at branch points
      bit 1 - level=2 - print every insn and verifier state at every insn
      bit 2 - level=4 - print verifier error and stats at the end of verification
      
      When verifier rejects the program the libbpf is trying to load the program twice.
      Once with log_level=0 (no messages, only error code is reported to user space)
      and second time with log_level=1 to tell the user why the verifier rejected it.
      
      With introduction of bit 2 - level=4 the libbpf can choose to always use that
      level and load programs once, since the verification speed is not affected and
      in case of error the verbose message will be available.
      
      Note that the verifier stats are not part of uapi just like all other
      verbose messages. They're expected to change in the future.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      06ee7115
  14. 14 3月, 2019 1 次提交
    • M
      bpf: Fix bpf_tcp_sock and bpf_sk_fullsock issue related to bpf_sk_release · 1b986589
      Martin KaFai Lau 提交于
      Lorenz Bauer [thanks!] reported that a ptr returned by bpf_tcp_sock(sk)
      can still be accessed after bpf_sk_release(sk).
      Both bpf_tcp_sock() and bpf_sk_fullsock() have the same issue.
      This patch addresses them together.
      
      A simple reproducer looks like this:
      
      	sk = bpf_sk_lookup_tcp();
      	/* if (!sk) ... */
      	tp = bpf_tcp_sock(sk);
      	/* if (!tp) ... */
      	bpf_sk_release(sk);
      	snd_cwnd = tp->snd_cwnd; /* oops! The verifier does not complain. */
      
      The problem is the verifier did not scrub the register's states of
      the tcp_sock ptr (tp) after bpf_sk_release(sk).
      
      [ Note that when calling bpf_tcp_sock(sk), the sk is not always
        refcount-acquired. e.g. bpf_tcp_sock(skb->sk). The verifier works
        fine for this case. ]
      
      Currently, the verifier does not track if a helper's return ptr (in REG_0)
      is "carry"-ing one of its argument's refcount status. To carry this info,
      the reg1->id needs to be stored in reg0.
      
      One approach was tried, like "reg0->id = reg1->id", when calling
      "bpf_tcp_sock()".  The main idea was to avoid adding another "ref_obj_id"
      for the same reg.  However, overlapping the NULL marking and ref
      tracking purpose in one "id" does not work well:
      
      	ref_sk = bpf_sk_lookup_tcp();
      	fullsock = bpf_sk_fullsock(ref_sk);
      	tp = bpf_tcp_sock(ref_sk);
      	if (!fullsock) {
      	     bpf_sk_release(ref_sk);
      	     return 0;
      	}
      	/* fullsock_reg->id is marked for NOT-NULL.
      	 * Same for tp_reg->id because they have the same id.
      	 */
      
      	/* oops. verifier did not complain about the missing !tp check */
      	snd_cwnd = tp->snd_cwnd;
      
      Hence, a new "ref_obj_id" is needed in "struct bpf_reg_state".
      With a new ref_obj_id, when bpf_sk_release(sk) is called, the verifier can
      scrub all reg states which has a ref_obj_id match.  It is done with the
      changes in release_reg_references() in this patch.
      
      While fixing it, sk_to_full_sk() is removed from bpf_tcp_sock() and
      bpf_sk_fullsock() to avoid these helpers from returning
      another ptr. It will make bpf_sk_release(tp) possible:
      
      	sk = bpf_sk_lookup_tcp();
      	/* if (!sk) ... */
      	tp = bpf_tcp_sock(sk);
      	/* if (!tp) ... */
      	bpf_sk_release(tp);
      
      A separate helper "bpf_get_listener_sock()" will be added in a later
      patch to do sk_to_full_sk().
      
      Misc change notes:
      - To allow bpf_sk_release(tp), the arg of bpf_sk_release() is changed
        from ARG_PTR_TO_SOCKET to ARG_PTR_TO_SOCK_COMMON.  ARG_PTR_TO_SOCKET
        is removed from bpf.h since no helper is using it.
      
      - arg_type_is_refcounted() is renamed to arg_type_may_be_refcounted()
        because ARG_PTR_TO_SOCK_COMMON is the only one and skb->sk is not
        refcounted.  All bpf_sk_release(), bpf_sk_fullsock() and bpf_tcp_sock()
        take ARG_PTR_TO_SOCK_COMMON.
      
      - check_refcount_ok() ensures is_acquire_function() cannot take
        arg_type_may_be_refcounted() as its argument.
      
      - The check_func_arg() can only allow one refcount-ed arg.  It is
        guaranteed by check_refcount_ok() which ensures at most one arg can be
        refcounted.  Hence, it is a verifier internal error if >1 refcount arg
        found in check_func_arg().
      
      - In release_reference(), release_reference_state() is called
        first to ensure a match on "reg->ref_obj_id" can be found before
        scrubbing the reg states with release_reg_references().
      
      - reg_is_refcounted() is no longer needed.
        1. In mark_ptr_or_null_regs(), its usage is replaced by
           "ref_obj_id && ref_obj_id == id" because,
           when is_null == true, release_reference_state() should only be
           called on the ref_obj_id obtained by a acquire helper (i.e.
           is_acquire_function() == true).  Otherwise, the following
           would happen:
      
      	sk = bpf_sk_lookup_tcp();
      	/* if (!sk) { ... } */
      	fullsock = bpf_sk_fullsock(sk);
      	if (!fullsock) {
      		/*
      		 * release_reference_state(fullsock_reg->ref_obj_id)
      		 * where fullsock_reg->ref_obj_id == sk_reg->ref_obj_id.
      		 *
      		 * Hence, the following bpf_sk_release(sk) will fail
      		 * because the ref state has already been released in the
      		 * earlier release_reference_state(fullsock_reg->ref_obj_id).
      		 */
      		bpf_sk_release(sk);
      	}
      
        2. In release_reg_references(), the current reg_is_refcounted() call
           is unnecessary because the id check is enough.
      
      - The type_is_refcounted() and type_is_refcounted_or_null()
        are no longer needed also because reg_is_refcounted() is removed.
      
      Fixes: 655a51e5 ("bpf: Add struct bpf_tcp_sock and BPF_FUNC_tcp_sock")
      Reported-by: NLorenz Bauer <lmb@cloudflare.com>
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      1b986589
  15. 02 2月, 2019 1 次提交
    • A
      bpf: introduce bpf_spin_lock · d83525ca
      Alexei Starovoitov 提交于
      Introduce 'struct bpf_spin_lock' and bpf_spin_lock/unlock() helpers to let
      bpf program serialize access to other variables.
      
      Example:
      struct hash_elem {
          int cnt;
          struct bpf_spin_lock lock;
      };
      struct hash_elem * val = bpf_map_lookup_elem(&hash_map, &key);
      if (val) {
          bpf_spin_lock(&val->lock);
          val->cnt++;
          bpf_spin_unlock(&val->lock);
      }
      
      Restrictions and safety checks:
      - bpf_spin_lock is only allowed inside HASH and ARRAY maps.
      - BTF description of the map is mandatory for safety analysis.
      - bpf program can take one bpf_spin_lock at a time, since two or more can
        cause dead locks.
      - only one 'struct bpf_spin_lock' is allowed per map element.
        It drastically simplifies implementation yet allows bpf program to use
        any number of bpf_spin_locks.
      - when bpf_spin_lock is taken the calls (either bpf2bpf or helpers) are not allowed.
      - bpf program must bpf_spin_unlock() before return.
      - bpf program can access 'struct bpf_spin_lock' only via
        bpf_spin_lock()/bpf_spin_unlock() helpers.
      - load/store into 'struct bpf_spin_lock lock;' field is not allowed.
      - to use bpf_spin_lock() helper the BTF description of map value must be
        a struct and have 'struct bpf_spin_lock anyname;' field at the top level.
        Nested lock inside another struct is not allowed.
      - syscall map_lookup doesn't copy bpf_spin_lock field to user space.
      - syscall map_update and program map_update do not update bpf_spin_lock field.
      - bpf_spin_lock cannot be on the stack or inside networking packet.
        bpf_spin_lock can only be inside HASH or ARRAY map value.
      - bpf_spin_lock is available to root only and to all program types.
      - bpf_spin_lock is not allowed in inner maps of map-in-map.
      - ld_abs is not allowed inside spin_lock-ed region.
      - tracing progs and socket filter progs cannot use bpf_spin_lock due to
        insufficient preemption checks
      
      Implementation details:
      - cgroup-bpf class of programs can nest with xdp/tc programs.
        Hence bpf_spin_lock is equivalent to spin_lock_irqsave.
        Other solutions to avoid nested bpf_spin_lock are possible.
        Like making sure that all networking progs run with softirq disabled.
        spin_lock_irqsave is the simplest and doesn't add overhead to the
        programs that don't use it.
      - arch_spinlock_t is used when its implemented as queued_spin_lock
      - archs can force their own arch_spinlock_t
      - on architectures where queued_spin_lock is not available and
        sizeof(arch_spinlock_t) != sizeof(__u32) trivial lock is used.
      - presence of bpf_spin_lock inside map value could have been indicated via
        extra flag during map_create, but specifying it via BTF is cleaner.
        It provides introspection for map key/value and reduces user mistakes.
      
      Next steps:
      - allow bpf_spin_lock in other map types (like cgroup local storage)
      - introduce BPF_F_LOCK flag for bpf_map_update() syscall and helper
        to request kernel to grab bpf_spin_lock before rewriting the value.
        That will serialize access to map elements.
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      d83525ca
  16. 24 1月, 2019 2 次提交
  17. 06 1月, 2019 1 次提交
  18. 03 1月, 2019 2 次提交
    • D
      bpf: prevent out of bounds speculation on pointer arithmetic · 979d63d5
      Daniel Borkmann 提交于
      Jann reported that the original commit back in b2157399
      ("bpf: prevent out-of-bounds speculation") was not sufficient
      to stop CPU from speculating out of bounds memory access:
      While b2157399 only focussed on masking array map access
      for unprivileged users for tail calls and data access such
      that the user provided index gets sanitized from BPF program
      and syscall side, there is still a more generic form affected
      from BPF programs that applies to most maps that hold user
      data in relation to dynamic map access when dealing with
      unknown scalars or "slow" known scalars as access offset, for
      example:
      
        - Load a map value pointer into R6
        - Load an index into R7
        - Do a slow computation (e.g. with a memory dependency) that
          loads a limit into R8 (e.g. load the limit from a map for
          high latency, then mask it to make the verifier happy)
        - Exit if R7 >= R8 (mispredicted branch)
        - Load R0 = R6[R7]
        - Load R0 = R6[R0]
      
      For unknown scalars there are two options in the BPF verifier
      where we could derive knowledge from in order to guarantee
      safe access to the memory: i) While </>/<=/>= variants won't
      allow to derive any lower or upper bounds from the unknown
      scalar where it would be safe to add it to the map value
      pointer, it is possible through ==/!= test however. ii) another
      option is to transform the unknown scalar into a known scalar,
      for example, through ALU ops combination such as R &= <imm>
      followed by R |= <imm> or any similar combination where the
      original information from the unknown scalar would be destroyed
      entirely leaving R with a constant. The initial slow load still
      precedes the latter ALU ops on that register, so the CPU
      executes speculatively from that point. Once we have the known
      scalar, any compare operation would work then. A third option
      only involving registers with known scalars could be crafted
      as described in [0] where a CPU port (e.g. Slow Int unit)
      would be filled with many dependent computations such that
      the subsequent condition depending on its outcome has to wait
      for evaluation on its execution port and thereby executing
      speculatively if the speculated code can be scheduled on a
      different execution port, or any other form of mistraining
      as described in [1], for example. Given this is not limited
      to only unknown scalars, not only map but also stack access
      is affected since both is accessible for unprivileged users
      and could potentially be used for out of bounds access under
      speculation.
      
      In order to prevent any of these cases, the verifier is now
      sanitizing pointer arithmetic on the offset such that any
      out of bounds speculation would be masked in a way where the
      pointer arithmetic result in the destination register will
      stay unchanged, meaning offset masked into zero similar as
      in array_index_nospec() case. With regards to implementation,
      there are three options that were considered: i) new insn
      for sanitation, ii) push/pop insn and sanitation as inlined
      BPF, iii) reuse of ax register and sanitation as inlined BPF.
      
      Option i) has the downside that we end up using from reserved
      bits in the opcode space, but also that we would require
      each JIT to emit masking as native arch opcodes meaning
      mitigation would have slow adoption till everyone implements
      it eventually which is counter-productive. Option ii) and iii)
      have both in common that a temporary register is needed in
      order to implement the sanitation as inlined BPF since we
      are not allowed to modify the source register. While a push /
      pop insn in ii) would be useful to have in any case, it
      requires once again that every JIT needs to implement it
      first. While possible, amount of changes needed would also
      be unsuitable for a -stable patch. Therefore, the path which
      has fewer changes, less BPF instructions for the mitigation
      and does not require anything to be changed in the JITs is
      option iii) which this work is pursuing. The ax register is
      already mapped to a register in all JITs (modulo arm32 where
      it's mapped to stack as various other BPF registers there)
      and used in constant blinding for JITs-only so far. It can
      be reused for verifier rewrites under certain constraints.
      The interpreter's tmp "register" has therefore been remapped
      into extending the register set with hidden ax register and
      reusing that for a number of instructions that needed the
      prior temporary variable internally (e.g. div, mod). This
      allows for zero increase in stack space usage in the interpreter,
      and enables (restricted) generic use in rewrites otherwise as
      long as such a patchlet does not make use of these instructions.
      The sanitation mask is dynamic and relative to the offset the
      map value or stack pointer currently holds.
      
      There are various cases that need to be taken under consideration
      for the masking, e.g. such operation could look as follows:
      ptr += val or val += ptr or ptr -= val. Thus, the value to be
      sanitized could reside either in source or in destination
      register, and the limit is different depending on whether
      the ALU op is addition or subtraction and depending on the
      current known and bounded offset. The limit is derived as
      follows: limit := max_value_size - (smin_value + off). For
      subtraction: limit := umax_value + off. This holds because
      we do not allow any pointer arithmetic that would
      temporarily go out of bounds or would have an unknown
      value with mixed signed bounds where it is unclear at
      verification time whether the actual runtime value would
      be either negative or positive. For example, we have a
      derived map pointer value with constant offset and bounded
      one, so limit based on smin_value works because the verifier
      requires that statically analyzed arithmetic on the pointer
      must be in bounds, and thus it checks if resulting
      smin_value + off and umax_value + off is still within map
      value bounds at time of arithmetic in addition to time of
      access. Similarly, for the case of stack access we derive
      the limit as follows: MAX_BPF_STACK + off for subtraction
      and -off for the case of addition where off := ptr_reg->off +
      ptr_reg->var_off.value. Subtraction is a special case for
      the masking which can be in form of ptr += -val, ptr -= -val,
      or ptr -= val. In the first two cases where we know that
      the value is negative, we need to temporarily negate the
      value in order to do the sanitation on a positive value
      where we later swap the ALU op, and restore original source
      register if the value was in source.
      
      The sanitation of pointer arithmetic alone is still not fully
      sufficient as is, since a scenario like the following could
      happen ...
      
        PTR += 0x1000 (e.g. K-based imm)
        PTR -= BIG_NUMBER_WITH_SLOW_COMPARISON
        PTR += 0x1000
        PTR -= BIG_NUMBER_WITH_SLOW_COMPARISON
        [...]
      
      ... which under speculation could end up as ...
      
        PTR += 0x1000
        PTR -= 0 [ truncated by mitigation ]
        PTR += 0x1000
        PTR -= 0 [ truncated by mitigation ]
        [...]
      
      ... and therefore still access out of bounds. To prevent such
      case, the verifier is also analyzing safety for potential out
      of bounds access under speculative execution. Meaning, it is
      also simulating pointer access under truncation. We therefore
      "branch off" and push the current verification state after the
      ALU operation with known 0 to the verification stack for later
      analysis. Given the current path analysis succeeded it is
      likely that the one under speculation can be pruned. In any
      case, it is also subject to existing complexity limits and
      therefore anything beyond this point will be rejected. In
      terms of pruning, it needs to be ensured that the verification
      state from speculative execution simulation must never prune
      a non-speculative execution path, therefore, we mark verifier
      state accordingly at the time of push_stack(). If verifier
      detects out of bounds access under speculative execution from
      one of the possible paths that includes a truncation, it will
      reject such program.
      
      Given we mask every reg-based pointer arithmetic for
      unprivileged programs, we've been looking into how it could
      affect real-world programs in terms of size increase. As the
      majority of programs are targeted for privileged-only use
      case, we've unconditionally enabled masking (with its alu
      restrictions on top of it) for privileged programs for the
      sake of testing in order to check i) whether they get rejected
      in its current form, and ii) by how much the number of
      instructions and size will increase. We've tested this by
      using Katran, Cilium and test_l4lb from the kernel selftests.
      For Katran we've evaluated balancer_kern.o, Cilium bpf_lxc.o
      and an older test object bpf_lxc_opt_-DUNKNOWN.o and l4lb
      we've used test_l4lb.o as well as test_l4lb_noinline.o. We
      found that none of the programs got rejected by the verifier
      with this change, and that impact is rather minimal to none.
      balancer_kern.o had 13,904 bytes (1,738 insns) xlated and
      7,797 bytes JITed before and after the change. Most complex
      program in bpf_lxc.o had 30,544 bytes (3,817 insns) xlated
      and 18,538 bytes JITed before and after and none of the other
      tail call programs in bpf_lxc.o had any changes either. For
      the older bpf_lxc_opt_-DUNKNOWN.o object we found a small
      increase from 20,616 bytes (2,576 insns) and 12,536 bytes JITed
      before to 20,664 bytes (2,582 insns) and 12,558 bytes JITed
      after the change. Other programs from that object file had
      similar small increase. Both test_l4lb.o had no change and
      remained at 6,544 bytes (817 insns) xlated and 3,401 bytes
      JITed and for test_l4lb_noinline.o constant at 5,080 bytes
      (634 insns) xlated and 3,313 bytes JITed. This can be explained
      in that LLVM typically optimizes stack based pointer arithmetic
      by using K-based operations and that use of dynamic map access
      is not overly frequent. However, in future we may decide to
      optimize the algorithm further under known guarantees from
      branch and value speculation. Latter seems also unclear in
      terms of prediction heuristics that today's CPUs apply as well
      as whether there could be collisions in e.g. the predictor's
      Value History/Pattern Table for triggering out of bounds access,
      thus masking is performed unconditionally at this point but could
      be subject to relaxation later on. We were generally also
      brainstorming various other approaches for mitigation, but the
      blocker was always lack of available registers at runtime and/or
      overhead for runtime tracking of limits belonging to a specific
      pointer. Thus, we found this to be minimally intrusive under
      given constraints.
      
      With that in place, a simple example with sanitized access on
      unprivileged load at post-verification time looks as follows:
      
        # bpftool prog dump xlated id 282
        [...]
        28: (79) r1 = *(u64 *)(r7 +0)
        29: (79) r2 = *(u64 *)(r7 +8)
        30: (57) r1 &= 15
        31: (79) r3 = *(u64 *)(r0 +4608)
        32: (57) r3 &= 1
        33: (47) r3 |= 1
        34: (2d) if r2 > r3 goto pc+19
        35: (b4) (u32) r11 = (u32) 20479  |
        36: (1f) r11 -= r2                | Dynamic sanitation for pointer
        37: (4f) r11 |= r2                | arithmetic with registers
        38: (87) r11 = -r11               | containing bounded or known
        39: (c7) r11 s>>= 63              | scalars in order to prevent
        40: (5f) r11 &= r2                | out of bounds speculation.
        41: (0f) r4 += r11                |
        42: (71) r4 = *(u8 *)(r4 +0)
        43: (6f) r4 <<= r1
        [...]
      
      For the case where the scalar sits in the destination register
      as opposed to the source register, the following code is emitted
      for the above example:
      
        [...]
        16: (b4) (u32) r11 = (u32) 20479
        17: (1f) r11 -= r2
        18: (4f) r11 |= r2
        19: (87) r11 = -r11
        20: (c7) r11 s>>= 63
        21: (5f) r2 &= r11
        22: (0f) r2 += r0
        23: (61) r0 = *(u32 *)(r2 +0)
        [...]
      
      JIT blinding example with non-conflicting use of r10:
      
        [...]
         d5:	je     0x0000000000000106    _
         d7:	mov    0x0(%rax),%edi       |
         da:	mov    $0xf153246,%r10d     | Index load from map value and
         e0:	xor    $0xf153259,%r10      | (const blinded) mask with 0x1f.
         e7:	and    %r10,%rdi            |_
         ea:	mov    $0x2f,%r10d          |
         f0:	sub    %rdi,%r10            | Sanitized addition. Both use r10
         f3:	or     %rdi,%r10            | but do not interfere with each
         f6:	neg    %r10                 | other. (Neither do these instructions
         f9:	sar    $0x3f,%r10           | interfere with the use of ax as temp
         fd:	and    %r10,%rdi            | in interpreter.)
        100:	add    %rax,%rdi            |_
        103:	mov    0x0(%rdi),%eax
       [...]
      
      Tested that it fixes Jann's reproducer, and also checked that test_verifier
      and test_progs suite with interpreter, JIT and JIT with hardening enabled
      on x86-64 and arm64 runs successfully.
      
        [0] Speculose: Analyzing the Security Implications of Speculative
            Execution in CPUs, Giorgi Maisuradze and Christian Rossow,
            https://arxiv.org/pdf/1801.04084.pdf
      
        [1] A Systematic Evaluation of Transient Execution Attacks and
            Defenses, Claudio Canella, Jo Van Bulck, Michael Schwarz,
            Moritz Lipp, Benjamin von Berg, Philipp Ortner, Frank Piessens,
            Dmitry Evtyushkin, Daniel Gruss,
            https://arxiv.org/pdf/1811.05441.pdf
      
      Fixes: b2157399 ("bpf: prevent out-of-bounds speculation")
      Reported-by: NJann Horn <jannh@google.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      979d63d5
    • D
      bpf: move {prev_,}insn_idx into verifier env · c08435ec
      Daniel Borkmann 提交于
      Move prev_insn_idx and insn_idx from the do_check() function into
      the verifier environment, so they can be read inside the various
      helper functions for handling the instructions. It's easier to put
      this into the environment rather than changing all call-sites only
      to pass it along. insn_idx is useful in particular since this later
      on allows to hold state in env->insn_aux_data[env->insn_idx].
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      c08435ec
  19. 15 12月, 2018 2 次提交
    • A
      bpf: add self-check logic to liveness analysis · 9242b5f5
      Alexei Starovoitov 提交于
      Introduce REG_LIVE_DONE to check the liveness propagation
      and prepare the states for merging.
      See algorithm description in clean_live_states().
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      9242b5f5
    • M
      bpf: verbose log bpf_line_info in verifier · d9762e84
      Martin KaFai Lau 提交于
      This patch adds bpf_line_info during the verifier's verbose.
      It can give error context for debug purpose.
      
      ~~~~~~~~~~
      Here is the verbose log for backedge:
      	while (a) {
      		a += bpf_get_smp_processor_id();
      		bpf_trace_printk(fmt, sizeof(fmt), a);
      	}
      
      ~> bpftool prog load ./test_loop.o /sys/fs/bpf/test_loop type tracepoint
      13: while (a) {
      3: a += bpf_get_smp_processor_id();
      back-edge from insn 13 to 3
      
      ~~~~~~~~~~
      Here is the verbose log for invalid pkt access:
      Modification to test_xdp_noinline.c:
      
      	data = (void *)(long)xdp->data;
      	data_end = (void *)(long)xdp->data_end;
      /*
      	if (data + 4 > data_end)
      		return XDP_DROP;
      */
      	*(u32 *)data = dst->dst;
      
      ~> bpftool prog load ./test_xdp_noinline.o /sys/fs/bpf/test_xdp_noinline type xdp
      ; data = (void *)(long)xdp->data;
      224: (79) r2 = *(u64 *)(r10 -112)
      225: (61) r2 = *(u32 *)(r2 +0)
      ; *(u32 *)data = dst->dst;
      226: (63) *(u32 *)(r2 +0) = r1
      invalid access to packet, off=0 size=4, R2(id=0,off=0,r=0)
      R2 offset is outside of the packet
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      d9762e84
  20. 10 12月, 2018 1 次提交
    • M
      bpf: Add bpf_line_info support · c454a46b
      Martin KaFai Lau 提交于
      This patch adds bpf_line_info support.
      
      It accepts an array of bpf_line_info objects during BPF_PROG_LOAD.
      The "line_info", "line_info_cnt" and "line_info_rec_size" are added
      to the "union bpf_attr".  The "line_info_rec_size" makes
      bpf_line_info extensible in the future.
      
      The new "check_btf_line()" ensures the userspace line_info is valid
      for the kernel to use.
      
      When the verifier is translating/patching the bpf_prog (through
      "bpf_patch_insn_single()"), the line_infos' insn_off is also
      adjusted by the newly added "bpf_adj_linfo()".
      
      If the bpf_prog is jited, this patch also provides the jited addrs (in
      aux->jited_linfo) for the corresponding line_info.insn_off.
      "bpf_prog_fill_jited_linfo()" is added to fill the aux->jited_linfo.
      It is currently called by the x86 jit.  Other jits can also use
      "bpf_prog_fill_jited_linfo()" and it will be done in the followup patches.
      In the future, if it deemed necessary, a particular jit could also provide
      its own "bpf_prog_fill_jited_linfo()" implementation.
      
      A few "*line_info*" fields are added to the bpf_prog_info such
      that the user can get the xlated line_info back (i.e. the line_info
      with its insn_off reflecting the translated prog).  The jited_line_info
      is available if the prog is jited.  It is an array of __u64.
      If the prog is not jited, jited_line_info_cnt is 0.
      
      The verifier's verbose log with line_info will be done in
      a follow up patch.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      c454a46b
  21. 27 11月, 2018 1 次提交
    • Y
      bpf: btf: support proper non-jit func info · ba64e7d8
      Yonghong Song 提交于
      Commit 838e9690 ("bpf: Introduce bpf_func_info")
      added bpf func info support. The userspace is able
      to get better ksym's for bpf programs with jit, and
      is able to print out func prototypes.
      
      For a program containing func-to-func calls, the existing
      implementation returns user specified number of function
      calls and BTF types if jit is enabled. If the jit is not
      enabled, it only returns the type for the main function.
      
      This is undesirable. Interpreter may still be used
      and we should keep feature identical regardless of
      whether jit is enabled or not.
      This patch fixed this discrepancy.
      
      Fixes: 838e9690 ("bpf: Introduce bpf_func_info")
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      ba64e7d8
  22. 21 11月, 2018 1 次提交
    • Y
      bpf: Introduce bpf_func_info · 838e9690
      Yonghong Song 提交于
      This patch added interface to load a program with the following
      additional information:
         . prog_btf_fd
         . func_info, func_info_rec_size and func_info_cnt
      where func_info will provide function range and type_id
      corresponding to each function.
      
      The func_info_rec_size is introduced in the UAPI to specify
      struct bpf_func_info size passed from user space. This
      intends to make bpf_func_info structure growable in the future.
      If the kernel gets a different bpf_func_info size from userspace,
      it will try to handle user request with part of bpf_func_info
      it can understand. In this patch, kernel can understand
        struct bpf_func_info {
             __u32   insn_offset;
             __u32   type_id;
        };
      If user passed a bpf func_info record size of 16 bytes, the
      kernel can still handle part of records with the above definition.
      
      If verifier agrees with function range provided by the user,
      the bpf_prog ksym for each function will use the func name
      provided in the type_id, which is supposed to provide better
      encoding as it is not limited by 16 bytes program name
      limitation and this is better for bpf program which contains
      multiple subprograms.
      
      The bpf_prog_info interface is also extended to
      return btf_id, func_info, func_info_rec_size and func_info_cnt
      to userspace, so userspace can print out the function prototype
      for each xlated function. The insn_offset in the returned
      func_info corresponds to the insn offset for xlated functions.
      With other jit related fields in bpf_prog_info, userspace can also
      print out function prototypes for each jited function.
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      838e9690
  23. 11 11月, 2018 1 次提交
  24. 01 11月, 2018 1 次提交
  25. 08 10月, 2018 1 次提交
  26. 03 10月, 2018 3 次提交
  27. 30 8月, 2018 1 次提交
    • E
      bpf/verifier: per-register parent pointers · 679c782d
      Edward Cree 提交于
      By giving each register its own liveness chain, we elide the skip_callee()
       logic.  Instead, each register's parent is the state it inherits from;
       both check_func_call() and prepare_func_exit() automatically connect
       reg states to the correct chain since when they copy the reg state across
       (r1-r5 into the callee as args, and r0 out as the return value) they also
       copy the parent pointer.
      Signed-off-by: NEdward Cree <ecree@solarflare.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      679c782d
  28. 24 5月, 2018 1 次提交
    • D
      bpf: properly enforce index mask to prevent out-of-bounds speculation · c93552c4
      Daniel Borkmann 提交于
      While reviewing the verifier code, I recently noticed that the
      following two program variants in relation to tail calls can be
      loaded.
      
      Variant 1:
      
        # bpftool p d x i 15
          0: (15) if r1 == 0x0 goto pc+3
          1: (18) r2 = map[id:5]
          3: (05) goto pc+2
          4: (18) r2 = map[id:6]
          6: (b7) r3 = 7
          7: (35) if r3 >= 0xa0 goto pc+2
          8: (54) (u32) r3 &= (u32) 255
          9: (85) call bpf_tail_call#12
         10: (b7) r0 = 1
         11: (95) exit
      
        # bpftool m s i 5
          5: prog_array  flags 0x0
              key 4B  value 4B  max_entries 4  memlock 4096B
        # bpftool m s i 6
          6: prog_array  flags 0x0
              key 4B  value 4B  max_entries 160  memlock 4096B
      
      Variant 2:
      
        # bpftool p d x i 20
          0: (15) if r1 == 0x0 goto pc+3
          1: (18) r2 = map[id:8]
          3: (05) goto pc+2
          4: (18) r2 = map[id:7]
          6: (b7) r3 = 7
          7: (35) if r3 >= 0x4 goto pc+2
          8: (54) (u32) r3 &= (u32) 3
          9: (85) call bpf_tail_call#12
         10: (b7) r0 = 1
         11: (95) exit
      
        # bpftool m s i 8
          8: prog_array  flags 0x0
              key 4B  value 4B  max_entries 160  memlock 4096B
        # bpftool m s i 7
          7: prog_array  flags 0x0
              key 4B  value 4B  max_entries 4  memlock 4096B
      
      In both cases the index masking inserted by the verifier in order
      to control out of bounds speculation from a CPU via b2157399
      ("bpf: prevent out-of-bounds speculation") seems to be incorrect
      in what it is enforcing. In the 1st variant, the mask is applied
      from the map with the significantly larger number of entries where
      we would allow to a certain degree out of bounds speculation for
      the smaller map, and in the 2nd variant where the mask is applied
      from the map with the smaller number of entries, we get buggy
      behavior since we truncate the index of the larger map.
      
      The original intent from commit b2157399 is to reject such
      occasions where two or more different tail call maps are used
      in the same tail call helper invocation. However, the check on
      the BPF_MAP_PTR_POISON is never hit since we never poisoned the
      saved pointer in the first place! We do this explicitly for map
      lookups but in case of tail calls we basically used the tail
      call map in insn_aux_data that was processed in the most recent
      path which the verifier walked. Thus any prior path that stored
      a pointer in insn_aux_data at the helper location was always
      overridden.
      
      Fix it by moving the map pointer poison logic into a small helper
      that covers both BPF helpers with the same logic. After that in
      fixup_bpf_calls() the poison check is then hit for tail calls
      and the program rejected. Latter only happens in unprivileged
      case since this is the *only* occasion where a rewrite needs to
      happen, and where such rewrite is specific to the map (max_entries,
      index_mask). In the privileged case the rewrite is generic for
      the insn->imm / insn->code update so multiple maps from different
      paths can be handled just fine since all the remaining logic
      happens in the instruction processing itself. This is similar
      to the case of map lookups: in case there is a collision of
      maps in fixup_bpf_calls() we must skip the inlined rewrite since
      this will turn the generic instruction sequence into a non-
      generic one. Thus the patch_call_imm will simply update the
      insn->imm location where the bpf_map_lookup_elem() will later
      take care of the dispatch. Given we need this 'poison' state
      as a check, the information of whether a map is an unpriv_array
      gets lost, so enforcing it prior to that needs an additional
      state. In general this check is needed since there are some
      complex and tail call intensive BPF programs out there where
      LLVM tends to generate such code occasionally. We therefore
      convert the map_ptr rather into map_state to store all this
      w/o extra memory overhead, and the bit whether one of the maps
      involved in the collision was from an unpriv_array thus needs
      to be retained as well there.
      
      Fixes: b2157399 ("bpf: prevent out-of-bounds speculation")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      c93552c4
  29. 20 5月, 2018 1 次提交
    • A
      bpf: Prevent memory disambiguation attack · af86ca4e
      Alexei Starovoitov 提交于
      Detect code patterns where malicious 'speculative store bypass' can be used
      and sanitize such patterns.
      
       39: (bf) r3 = r10
       40: (07) r3 += -216
       41: (79) r8 = *(u64 *)(r7 +0)   // slow read
       42: (7a) *(u64 *)(r10 -72) = 0  // verifier inserts this instruction
       43: (7b) *(u64 *)(r8 +0) = r3   // this store becomes slow due to r8
       44: (79) r1 = *(u64 *)(r6 +0)   // cpu speculatively executes this load
       45: (71) r2 = *(u8 *)(r1 +0)    // speculatively arbitrary 'load byte'
                                       // is now sanitized
      
      Above code after x86 JIT becomes:
       e5: mov    %rbp,%rdx
       e8: add    $0xffffffffffffff28,%rdx
       ef: mov    0x0(%r13),%r14
       f3: movq   $0x0,-0x48(%rbp)
       fb: mov    %rdx,0x0(%r14)
       ff: mov    0x0(%rbx),%rdi
      103: movzbq 0x0(%rdi),%rsi
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      af86ca4e
  30. 17 5月, 2018 1 次提交
  31. 04 5月, 2018 1 次提交