1. 19 12月, 2021 2 次提交
    • H
      bpf: Replace PTR_TO_XXX_OR_NULL with PTR_TO_XXX | PTR_MAYBE_NULL · c25b2ae1
      Hao Luo 提交于
      We have introduced a new type to make bpf_reg composable, by
      allocating bits in the type to represent flags.
      
      One of the flags is PTR_MAYBE_NULL which indicates a pointer
      may be NULL. This patch switches the qualified reg_types to
      use this flag. The reg_types changed in this patch include:
      
      1. PTR_TO_MAP_VALUE_OR_NULL
      2. PTR_TO_SOCKET_OR_NULL
      3. PTR_TO_SOCK_COMMON_OR_NULL
      4. PTR_TO_TCP_SOCK_OR_NULL
      5. PTR_TO_BTF_ID_OR_NULL
      6. PTR_TO_MEM_OR_NULL
      7. PTR_TO_RDONLY_BUF_OR_NULL
      8. PTR_TO_RDWR_BUF_OR_NULL
      Signed-off-by: NHao Luo <haoluo@google.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/r/20211217003152.48334-5-haoluo@google.com
      c25b2ae1
    • H
      bpf: Introduce composable reg, ret and arg types. · d639b9d1
      Hao Luo 提交于
      There are some common properties shared between bpf reg, ret and arg
      values. For instance, a value may be a NULL pointer, or a pointer to
      a read-only memory. Previously, to express these properties, enumeration
      was used. For example, in order to test whether a reg value can be NULL,
      reg_type_may_be_null() simply enumerates all types that are possibly
      NULL. The problem of this approach is that it's not scalable and causes
      a lot of duplication. These properties can be combined, for example, a
      type could be either MAYBE_NULL or RDONLY, or both.
      
      This patch series rewrites the layout of reg_type, arg_type and
      ret_type, so that common properties can be extracted and represented as
      composable flag. For example, one can write
      
       ARG_PTR_TO_MEM | PTR_MAYBE_NULL
      
      which is equivalent to the previous
      
       ARG_PTR_TO_MEM_OR_NULL
      
      The type ARG_PTR_TO_MEM are called "base type" in this patch. Base
      types can be extended with flags. A flag occupies the higher bits while
      base types sits in the lower bits.
      
      This patch in particular sets up a set of macro for this purpose. The
      following patches will rewrite arg_types, ret_types and reg_types
      respectively.
      Signed-off-by: NHao Luo <haoluo@google.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20211217003152.48334-2-haoluo@google.com
      d639b9d1
  2. 17 12月, 2021 2 次提交
    • C
      bpf: Right align verifier states in verifier logs. · 2e576648
      Christy Lee 提交于
      Make the verifier logs more readable, print the verifier states
      on the corresponding instruction line. If the previous line was
      not a bpf instruction, then print the verifier states on its own
      line.
      
      Before:
      
      Validating test_pkt_access_subprog3() func#3...
      86: R1=invP(id=0) R2=ctx(id=0,off=0,imm=0) R10=fp0
      ; int test_pkt_access_subprog3(int val, struct __sk_buff *skb)
      86: (bf) r6 = r2
      87: R2=ctx(id=0,off=0,imm=0) R6_w=ctx(id=0,off=0,imm=0)
      87: (bc) w7 = w1
      88: R1=invP(id=0) R7_w=invP(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff))
      ; return get_skb_len(skb) * get_skb_ifindex(val, skb, get_constant(123));
      88: (bf) r1 = r6
      89: R1_w=ctx(id=0,off=0,imm=0) R6_w=ctx(id=0,off=0,imm=0)
      89: (85) call pc+9
      Func#4 is global and valid. Skipping.
      90: R0_w=invP(id=0)
      90: (bc) w8 = w0
      91: R0_w=invP(id=0) R8_w=invP(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff))
      ; return get_skb_len(skb) * get_skb_ifindex(val, skb, get_constant(123));
      91: (b7) r1 = 123
      92: R1_w=invP123
      92: (85) call pc+65
      Func#5 is global and valid. Skipping.
      93: R0=invP(id=0)
      
      After:
      
      86: R1=invP(id=0) R2=ctx(id=0,off=0,imm=0) R10=fp0
      ; int test_pkt_access_subprog3(int val, struct __sk_buff *skb)
      86: (bf) r6 = r2                      ; R2=ctx(id=0,off=0,imm=0) R6_w=ctx(id=0,off=0,imm=0)
      87: (bc) w7 = w1                      ; R1=invP(id=0) R7_w=invP(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff))
      ; return get_skb_len(skb) * get_skb_ifindex(val, skb, get_constant(123));
      88: (bf) r1 = r6                      ; R1_w=ctx(id=0,off=0,imm=0) R6_w=ctx(id=0,off=0,imm=0)
      89: (85) call pc+9
      Func#4 is global and valid. Skipping.
      90: R0_w=invP(id=0)
      90: (bc) w8 = w0                      ; R0_w=invP(id=0) R8_w=invP(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff))
      ; return get_skb_len(skb) * get_skb_ifindex(val, skb, get_constant(123));
      91: (b7) r1 = 123                     ; R1_w=invP123
      92: (85) call pc+65
      Func#5 is global and valid. Skipping.
      93: R0=invP(id=0)
      Signed-off-by: NChristy Lee <christylee@fb.com>
      Acked-by: NAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      2e576648
    • C
      bpf: Only print scratched registers and stack slots to verifier logs. · 0f55f9ed
      Christy Lee 提交于
      When printing verifier state for any log level, print full verifier
      state only on function calls or on errors. Otherwise, only print the
      registers and stack slots that were accessed.
      
      Log size differences:
      
      verif_scale_loop6 before: 234566564
      verif_scale_loop6 after: 72143943
      69% size reduction
      
      kfree_skb before: 166406
      kfree_skb after: 55386
      69% size reduction
      
      Before:
      
      156: (61) r0 = *(u32 *)(r1 +0)
      157: R0_w=invP(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff)) R1=ctx(id=0,off=0,imm=0) R2_w=invP0 R10=fp0 fp-8_w=00000000 fp-16_w=00\
      000000 fp-24_w=00000000 fp-32_w=00000000 fp-40_w=00000000 fp-48_w=00000000 fp-56_w=00000000 fp-64_w=00000000 fp-72_w=00000000 fp-80_w=00000\
      000 fp-88_w=00000000 fp-96_w=00000000 fp-104_w=00000000 fp-112_w=00000000 fp-120_w=00000000 fp-128_w=00000000 fp-136_w=00000000 fp-144_w=00\
      000000 fp-152_w=00000000 fp-160_w=00000000 fp-168_w=00000000 fp-176_w=00000000 fp-184_w=00000000 fp-192_w=00000000 fp-200_w=00000000 fp-208\
      _w=00000000 fp-216_w=00000000 fp-224_w=00000000 fp-232_w=00000000 fp-240_w=00000000 fp-248_w=00000000 fp-256_w=00000000 fp-264_w=00000000 f\
      p-272_w=00000000 fp-280_w=00000000 fp-288_w=00000000 fp-296_w=00000000 fp-304_w=00000000 fp-312_w=00000000 fp-320_w=00000000 fp-328_w=00000\
      000 fp-336_w=00000000 fp-344_w=00000000 fp-352_w=00000000 fp-360_w=00000000 fp-368_w=00000000 fp-376_w=00000000 fp-384_w=00000000 fp-392_w=\
      00000000 fp-400_w=00000000 fp-408_w=00000000 fp-416_w=00000000 fp-424_w=00000000 fp-432_w=00000000 fp-440_w=00000000 fp-448_w=00000000
      ; return skb->len;
      157: (95) exit
      Func#4 is safe for any args that match its prototype
      Validating get_constant() func#5...
      158: R1=invP(id=0) R10=fp0
      ; int get_constant(long val)
      158: (bf) r0 = r1
      159: R0_w=invP(id=1) R1=invP(id=1) R10=fp0
      ; return val - 122;
      159: (04) w0 += -122
      160: R0_w=invP(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff)) R1=invP(id=1) R10=fp0
      ; return val - 122;
      160: (95) exit
      Func#5 is safe for any args that match its prototype
      Validating get_skb_ifindex() func#6...
      161: R1=invP(id=0) R2=ctx(id=0,off=0,imm=0) R3=invP(id=0) R10=fp0
      ; int get_skb_ifindex(int val, struct __sk_buff *skb, int var)
      161: (bc) w0 = w3
      162: R0_w=invP(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff)) R1=invP(id=0) R2=ctx(id=0,off=0,imm=0) R3=invP(id=0) R10=fp0
      
      After:
      
      156: (61) r0 = *(u32 *)(r1 +0)
      157: R0_w=invP(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff)) R1=ctx(id=0,off=0,imm=0)
      ; return skb->len;
      157: (95) exit
      Func#4 is safe for any args that match its prototype
      Validating get_constant() func#5...
      158: R1=invP(id=0) R10=fp0
      ; int get_constant(long val)
      158: (bf) r0 = r1
      159: R0_w=invP(id=1) R1=invP(id=1)
      ; return val - 122;
      159: (04) w0 += -122
      160: R0_w=invP(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff))
      ; return val - 122;
      160: (95) exit
      Func#5 is safe for any args that match its prototype
      Validating get_skb_ifindex() func#6...
      161: R1=invP(id=0) R2=ctx(id=0,off=0,imm=0) R3=invP(id=0) R10=fp0
      ; int get_skb_ifindex(int val, struct __sk_buff *skb, int var)
      161: (bc) w0 = w3
      162: R0_w=invP(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff)) R3=invP(id=0)
      Signed-off-by: NChristy Lee <christylee@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20211216213358.3374427-2-christylee@fb.com
      0f55f9ed
  3. 05 12月, 2021 1 次提交
  4. 06 10月, 2021 1 次提交
    • K
      bpf: Introduce BPF support for kernel module function calls · 2357672c
      Kumar Kartikeya Dwivedi 提交于
      This change adds support on the kernel side to allow for BPF programs to
      call kernel module functions. Userspace will prepare an array of module
      BTF fds that is passed in during BPF_PROG_LOAD using fd_array parameter.
      In the kernel, the module BTFs are placed in the auxilliary struct for
      bpf_prog, and loaded as needed.
      
      The verifier then uses insn->off to index into the fd_array. insn->off
      0 is reserved for vmlinux BTF (for backwards compat), so userspace must
      use an fd_array index > 0 for module kfunc support. kfunc_btf_tab is
      sorted based on offset in an array, and each offset corresponds to one
      descriptor, with a max limit up to 256 such module BTFs.
      
      We also change existing kfunc_tab to distinguish each element based on
      imm, off pair as each such call will now be distinct.
      
      Another change is to check_kfunc_call callback, which now include a
      struct module * pointer, this is to be used in later patch such that the
      kfunc_id and module pointer are matched for dynamically registered BTF
      sets from loadable modules, so that same kfunc_id in two modules doesn't
      lead to check_kfunc_call succeeding. For the duration of the
      check_kfunc_call, the reference to struct module exists, as it returns
      the pointer stored in kfunc_btf_tab.
      Signed-off-by: NKumar Kartikeya Dwivedi <memxor@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20211002011757.311265-2-memxor@gmail.com
      2357672c
  5. 29 7月, 2021 1 次提交
    • D
      bpf: Fix leakage due to insufficient speculative store bypass mitigation · 2039f26f
      Daniel Borkmann 提交于
      Spectre v4 gadgets make use of memory disambiguation, which is a set of
      techniques that execute memory access instructions, that is, loads and
      stores, out of program order; Intel's optimization manual, section 2.4.4.5:
      
        A load instruction micro-op may depend on a preceding store. Many
        microarchitectures block loads until all preceding store addresses are
        known. The memory disambiguator predicts which loads will not depend on
        any previous stores. When the disambiguator predicts that a load does
        not have such a dependency, the load takes its data from the L1 data
        cache. Eventually, the prediction is verified. If an actual conflict is
        detected, the load and all succeeding instructions are re-executed.
      
      af86ca4e ("bpf: Prevent memory disambiguation attack") tried to mitigate
      this attack by sanitizing the memory locations through preemptive "fast"
      (low latency) stores of zero prior to the actual "slow" (high latency) store
      of a pointer value such that upon dependency misprediction the CPU then
      speculatively executes the load of the pointer value and retrieves the zero
      value instead of the attacker controlled scalar value previously stored at
      that location, meaning, subsequent access in the speculative domain is then
      redirected to the "zero page".
      
      The sanitized preemptive store of zero prior to the actual "slow" store is
      done through a simple ST instruction based on r10 (frame pointer) with
      relative offset to the stack location that the verifier has been tracking
      on the original used register for STX, which does not have to be r10. Thus,
      there are no memory dependencies for this store, since it's only using r10
      and immediate constant of zero; hence af86ca4e /assumed/ a low latency
      operation.
      
      However, a recent attack demonstrated that this mitigation is not sufficient
      since the preemptive store of zero could also be turned into a "slow" store
      and is thus bypassed as well:
      
        [...]
        // r2 = oob address (e.g. scalar)
        // r7 = pointer to map value
        31: (7b) *(u64 *)(r10 -16) = r2
        // r9 will remain "fast" register, r10 will become "slow" register below
        32: (bf) r9 = r10
        // JIT maps BPF reg to x86 reg:
        //  r9  -> r15 (callee saved)
        //  r10 -> rbp
        // train store forward prediction to break dependency link between both r9
        // and r10 by evicting them from the predictor's LRU table.
        33: (61) r0 = *(u32 *)(r7 +24576)
        34: (63) *(u32 *)(r7 +29696) = r0
        35: (61) r0 = *(u32 *)(r7 +24580)
        36: (63) *(u32 *)(r7 +29700) = r0
        37: (61) r0 = *(u32 *)(r7 +24584)
        38: (63) *(u32 *)(r7 +29704) = r0
        39: (61) r0 = *(u32 *)(r7 +24588)
        40: (63) *(u32 *)(r7 +29708) = r0
        [...]
        543: (61) r0 = *(u32 *)(r7 +25596)
        544: (63) *(u32 *)(r7 +30716) = r0
        // prepare call to bpf_ringbuf_output() helper. the latter will cause rbp
        // to spill to stack memory while r13/r14/r15 (all callee saved regs) remain
        // in hardware registers. rbp becomes slow due to push/pop latency. below is
        // disasm of bpf_ringbuf_output() helper for better visual context:
        //
        // ffffffff8117ee20: 41 54                 push   r12
        // ffffffff8117ee22: 55                    push   rbp
        // ffffffff8117ee23: 53                    push   rbx
        // ffffffff8117ee24: 48 f7 c1 fc ff ff ff  test   rcx,0xfffffffffffffffc
        // ffffffff8117ee2b: 0f 85 af 00 00 00     jne    ffffffff8117eee0 <-- jump taken
        // [...]
        // ffffffff8117eee0: 49 c7 c4 ea ff ff ff  mov    r12,0xffffffffffffffea
        // ffffffff8117eee7: 5b                    pop    rbx
        // ffffffff8117eee8: 5d                    pop    rbp
        // ffffffff8117eee9: 4c 89 e0              mov    rax,r12
        // ffffffff8117eeec: 41 5c                 pop    r12
        // ffffffff8117eeee: c3                    ret
        545: (18) r1 = map[id:4]
        547: (bf) r2 = r7
        548: (b7) r3 = 0
        549: (b7) r4 = 4
        550: (85) call bpf_ringbuf_output#194288
        // instruction 551 inserted by verifier    \
        551: (7a) *(u64 *)(r10 -16) = 0            | /both/ are now slow stores here
        // storing map value pointer r7 at fp-16   | since value of r10 is "slow".
        552: (7b) *(u64 *)(r10 -16) = r7           /
        // following "fast" read to the same memory location, but due to dependency
        // misprediction it will speculatively execute before insn 551/552 completes.
        553: (79) r2 = *(u64 *)(r9 -16)
        // in speculative domain contains attacker controlled r2. in non-speculative
        // domain this contains r7, and thus accesses r7 +0 below.
        554: (71) r3 = *(u8 *)(r2 +0)
        // leak r3
      
      As can be seen, the current speculative store bypass mitigation which the
      verifier inserts at line 551 is insufficient since /both/, the write of
      the zero sanitation as well as the map value pointer are a high latency
      instruction due to prior memory access via push/pop of r10 (rbp) in contrast
      to the low latency read in line 553 as r9 (r15) which stays in hardware
      registers. Thus, architecturally, fp-16 is r7, however, microarchitecturally,
      fp-16 can still be r2.
      
      Initial thoughts to address this issue was to track spilled pointer loads
      from stack and enforce their load via LDX through r10 as well so that /both/
      the preemptive store of zero /as well as/ the load use the /same/ register
      such that a dependency is created between the store and load. However, this
      option is not sufficient either since it can be bypassed as well under
      speculation. An updated attack with pointer spill/fills now _all_ based on
      r10 would look as follows:
      
        [...]
        // r2 = oob address (e.g. scalar)
        // r7 = pointer to map value
        [...]
        // longer store forward prediction training sequence than before.
        2062: (61) r0 = *(u32 *)(r7 +25588)
        2063: (63) *(u32 *)(r7 +30708) = r0
        2064: (61) r0 = *(u32 *)(r7 +25592)
        2065: (63) *(u32 *)(r7 +30712) = r0
        2066: (61) r0 = *(u32 *)(r7 +25596)
        2067: (63) *(u32 *)(r7 +30716) = r0
        // store the speculative load address (scalar) this time after the store
        // forward prediction training.
        2068: (7b) *(u64 *)(r10 -16) = r2
        // preoccupy the CPU store port by running sequence of dummy stores.
        2069: (63) *(u32 *)(r7 +29696) = r0
        2070: (63) *(u32 *)(r7 +29700) = r0
        2071: (63) *(u32 *)(r7 +29704) = r0
        2072: (63) *(u32 *)(r7 +29708) = r0
        2073: (63) *(u32 *)(r7 +29712) = r0
        2074: (63) *(u32 *)(r7 +29716) = r0
        2075: (63) *(u32 *)(r7 +29720) = r0
        2076: (63) *(u32 *)(r7 +29724) = r0
        2077: (63) *(u32 *)(r7 +29728) = r0
        2078: (63) *(u32 *)(r7 +29732) = r0
        2079: (63) *(u32 *)(r7 +29736) = r0
        2080: (63) *(u32 *)(r7 +29740) = r0
        2081: (63) *(u32 *)(r7 +29744) = r0
        2082: (63) *(u32 *)(r7 +29748) = r0
        2083: (63) *(u32 *)(r7 +29752) = r0
        2084: (63) *(u32 *)(r7 +29756) = r0
        2085: (63) *(u32 *)(r7 +29760) = r0
        2086: (63) *(u32 *)(r7 +29764) = r0
        2087: (63) *(u32 *)(r7 +29768) = r0
        2088: (63) *(u32 *)(r7 +29772) = r0
        2089: (63) *(u32 *)(r7 +29776) = r0
        2090: (63) *(u32 *)(r7 +29780) = r0
        2091: (63) *(u32 *)(r7 +29784) = r0
        2092: (63) *(u32 *)(r7 +29788) = r0
        2093: (63) *(u32 *)(r7 +29792) = r0
        2094: (63) *(u32 *)(r7 +29796) = r0
        2095: (63) *(u32 *)(r7 +29800) = r0
        2096: (63) *(u32 *)(r7 +29804) = r0
        2097: (63) *(u32 *)(r7 +29808) = r0
        2098: (63) *(u32 *)(r7 +29812) = r0
        // overwrite scalar with dummy pointer; same as before, also including the
        // sanitation store with 0 from the current mitigation by the verifier.
        2099: (7a) *(u64 *)(r10 -16) = 0         | /both/ are now slow stores here
        2100: (7b) *(u64 *)(r10 -16) = r7        | since store unit is still busy.
        // load from stack intended to bypass stores.
        2101: (79) r2 = *(u64 *)(r10 -16)
        2102: (71) r3 = *(u8 *)(r2 +0)
        // leak r3
        [...]
      
      Looking at the CPU microarchitecture, the scheduler might issue loads (such
      as seen in line 2101) before stores (line 2099,2100) because the load execution
      units become available while the store execution unit is still busy with the
      sequence of dummy stores (line 2069-2098). And so the load may use the prior
      stored scalar from r2 at address r10 -16 for speculation. The updated attack
      may work less reliable on CPU microarchitectures where loads and stores share
      execution resources.
      
      This concludes that the sanitizing with zero stores from af86ca4e ("bpf:
      Prevent memory disambiguation attack") is insufficient. Moreover, the detection
      of stack reuse from af86ca4e where previously data (STACK_MISC) has been
      written to a given stack slot where a pointer value is now to be stored does
      not have sufficient coverage as precondition for the mitigation either; for
      several reasons outlined as follows:
      
       1) Stack content from prior program runs could still be preserved and is
          therefore not "random", best example is to split a speculative store
          bypass attack between tail calls, program A would prepare and store the
          oob address at a given stack slot and then tail call into program B which
          does the "slow" store of a pointer to the stack with subsequent "fast"
          read. From program B PoV such stack slot type is STACK_INVALID, and
          therefore also must be subject to mitigation.
      
       2) The STACK_SPILL must not be coupled to register_is_const(&stack->spilled_ptr)
          condition, for example, the previous content of that memory location could
          also be a pointer to map or map value. Without the fix, a speculative
          store bypass is not mitigated in such precondition and can then lead to
          a type confusion in the speculative domain leaking kernel memory near
          these pointer types.
      
      While brainstorming on various alternative mitigation possibilities, we also
      stumbled upon a retrospective from Chrome developers [0]:
      
        [...] For variant 4, we implemented a mitigation to zero the unused memory
        of the heap prior to allocation, which cost about 1% when done concurrently
        and 4% for scavenging. Variant 4 defeats everything we could think of. We
        explored more mitigations for variant 4 but the threat proved to be more
        pervasive and dangerous than we anticipated. For example, stack slots used
        by the register allocator in the optimizing compiler could be subject to
        type confusion, leading to pointer crafting. Mitigating type confusion for
        stack slots alone would have required a complete redesign of the backend of
        the optimizing compiler, perhaps man years of work, without a guarantee of
        completeness. [...]
      
      From BPF side, the problem space is reduced, however, options are rather
      limited. One idea that has been explored was to xor-obfuscate pointer spills
      to the BPF stack:
      
        [...]
        // preoccupy the CPU store port by running sequence of dummy stores.
        [...]
        2106: (63) *(u32 *)(r7 +29796) = r0
        2107: (63) *(u32 *)(r7 +29800) = r0
        2108: (63) *(u32 *)(r7 +29804) = r0
        2109: (63) *(u32 *)(r7 +29808) = r0
        2110: (63) *(u32 *)(r7 +29812) = r0
        // overwrite scalar with dummy pointer; xored with random 'secret' value
        // of 943576462 before store ...
        2111: (b4) w11 = 943576462
        2112: (af) r11 ^= r7
        2113: (7b) *(u64 *)(r10 -16) = r11
        2114: (79) r11 = *(u64 *)(r10 -16)
        2115: (b4) w2 = 943576462
        2116: (af) r2 ^= r11
        // ... and restored with the same 'secret' value with the help of AX reg.
        2117: (71) r3 = *(u8 *)(r2 +0)
        [...]
      
      While the above would not prevent speculation, it would make data leakage
      infeasible by directing it to random locations. In order to be effective
      and prevent type confusion under speculation, such random secret would have
      to be regenerated for each store. The additional complexity involved for a
      tracking mechanism that prevents jumps such that restoring spilled pointers
      would not get corrupted is not worth the gain for unprivileged. Hence, the
      fix in here eventually opted for emitting a non-public BPF_ST | BPF_NOSPEC
      instruction which the x86 JIT translates into a lfence opcode. Inserting the
      latter in between the store and load instruction is one of the mitigations
      options [1]. The x86 instruction manual notes:
      
        [...] An LFENCE that follows an instruction that stores to memory might
        complete before the data being stored have become globally visible. [...]
      
      The latter meaning that the preceding store instruction finished execution
      and the store is at minimum guaranteed to be in the CPU's store queue, but
      it's not guaranteed to be in that CPU's L1 cache at that point (globally
      visible). The latter would only be guaranteed via sfence. So the load which
      is guaranteed to execute after the lfence for that local CPU would have to
      rely on store-to-load forwarding. [2], in section 2.3 on store buffers says:
      
        [...] For every store operation that is added to the ROB, an entry is
        allocated in the store buffer. This entry requires both the virtual and
        physical address of the target. Only if there is no free entry in the store
        buffer, the frontend stalls until there is an empty slot available in the
        store buffer again. Otherwise, the CPU can immediately continue adding
        subsequent instructions to the ROB and execute them out of order. On Intel
        CPUs, the store buffer has up to 56 entries. [...]
      
      One small upside on the fix is that it lifts constraints from af86ca4e
      where the sanitize_stack_off relative to r10 must be the same when coming
      from different paths. The BPF_ST | BPF_NOSPEC gets emitted after a BPF_STX
      or BPF_ST instruction. This happens either when we store a pointer or data
      value to the BPF stack for the first time, or upon later pointer spills.
      The former needs to be enforced since otherwise stale stack data could be
      leaked under speculation as outlined earlier. For non-x86 JITs the BPF_ST |
      BPF_NOSPEC mapping is currently optimized away, but others could emit a
      speculation barrier as well if necessary. For real-world unprivileged
      programs e.g. generated by LLVM, pointer spill/fill is only generated upon
      register pressure and LLVM only tries to do that for pointers which are not
      used often. The program main impact will be the initial BPF_ST | BPF_NOSPEC
      sanitation for the STACK_INVALID case when the first write to a stack slot
      occurs e.g. upon map lookup. In future we might refine ways to mitigate
      the latter cost.
      
        [0] https://arxiv.org/pdf/1902.05178.pdf
        [1] https://msrc-blog.microsoft.com/2018/05/21/analysis-and-mitigation-of-speculative-store-bypass-cve-2018-3639/
        [2] https://arxiv.org/pdf/1905.05725.pdf
      
      Fixes: af86ca4e ("bpf: Prevent memory disambiguation attack")
      Fixes: f7cf25b2 ("bpf: track spill/fill of constants")
      Co-developed-by: NPiotr Krysiuk <piotras@gmail.com>
      Co-developed-by: NBenedict Schlueter <benedict.schlueter@rub.de>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NPiotr Krysiuk <piotras@gmail.com>
      Signed-off-by: NBenedict Schlueter <benedict.schlueter@rub.de>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      2039f26f
  6. 16 7月, 2021 4 次提交
  7. 19 5月, 2021 1 次提交
  8. 11 5月, 2021 1 次提交
  9. 03 5月, 2021 1 次提交
    • D
      bpf: Fix leakage of uninitialized bpf stack under speculation · 801c6058
      Daniel Borkmann 提交于
      The current implemented mechanisms to mitigate data disclosure under
      speculation mainly address stack and map value oob access from the
      speculative domain. However, Piotr discovered that uninitialized BPF
      stack is not protected yet, and thus old data from the kernel stack,
      potentially including addresses of kernel structures, could still be
      extracted from that 512 bytes large window. The BPF stack is special
      compared to map values since it's not zero initialized for every
      program invocation, whereas map values /are/ zero initialized upon
      their initial allocation and thus cannot leak any prior data in either
      domain. In the non-speculative domain, the verifier ensures that every
      stack slot read must have a prior stack slot write by the BPF program
      to avoid such data leaking issue.
      
      However, this is not enough: for example, when the pointer arithmetic
      operation moves the stack pointer from the last valid stack offset to
      the first valid offset, the sanitation logic allows for any intermediate
      offsets during speculative execution, which could then be used to
      extract any restricted stack content via side-channel.
      
      Given for unprivileged stack pointer arithmetic the use of unknown
      but bounded scalars is generally forbidden, we can simply turn the
      register-based arithmetic operation into an immediate-based arithmetic
      operation without the need for masking. This also gives the benefit
      of reducing the needed instructions for the operation. Given after
      the work in 7fedb63a ("bpf: Tighten speculative pointer arithmetic
      mask"), the aux->alu_limit already holds the final immediate value for
      the offset register with the known scalar. Thus, a simple mov of the
      immediate to AX register with using AX as the source for the original
      instruction is sufficient and possible now in this case.
      Reported-by: NPiotr Krysiuk <piotras@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Tested-by: NPiotr Krysiuk <piotras@gmail.com>
      Reviewed-by: NPiotr Krysiuk <piotras@gmail.com>
      Reviewed-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      801c6058
  10. 14 4月, 2021 1 次提交
  11. 27 2月, 2021 1 次提交
    • Y
      bpf: Add bpf_for_each_map_elem() helper · 69c087ba
      Yonghong Song 提交于
      The bpf_for_each_map_elem() helper is introduced which
      iterates all map elements with a callback function. The
      helper signature looks like
        long bpf_for_each_map_elem(map, callback_fn, callback_ctx, flags)
      and for each map element, the callback_fn will be called. For example,
      like hashmap, the callback signature may look like
        long callback_fn(map, key, val, callback_ctx)
      
      There are two known use cases for this. One is from upstream ([1]) where
      a for_each_map_elem helper may help implement a timeout mechanism
      in a more generic way. Another is from our internal discussion
      for a firewall use case where a map contains all the rules. The packet
      data can be compared to all these rules to decide allow or deny
      the packet.
      
      For array maps, users can already use a bounded loop to traverse
      elements. Using this helper can avoid using bounded loop. For other
      type of maps (e.g., hash maps) where bounded loop is hard or
      impossible to use, this helper provides a convenient way to
      operate on all elements.
      
      For callback_fn, besides map and map element, a callback_ctx,
      allocated on caller stack, is also passed to the callback
      function. This callback_ctx argument can provide additional
      input and allow to write to caller stack for output.
      
      If the callback_fn returns 0, the helper will iterate through next
      element if available. If the callback_fn returns 1, the helper
      will stop iterating and returns to the bpf program. Other return
      values are not used for now.
      
      Currently, this helper is only available with jit. It is possible
      to make it work with interpreter with so effort but I leave it
      as the future work.
      
      [1]: https://lore.kernel.org/bpf/20210122205415.113822-1-xiyou.wangcong@gmail.com/Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20210226204925.3884923-1-yhs@fb.com
      69c087ba
  12. 13 2月, 2021 1 次提交
    • D
      bpf: Support pointers in global func args · e5069b9c
      Dmitrii Banshchikov 提交于
      Add an ability to pass a pointer to a type with known size in arguments
      of a global function. Such pointers may be used to overcome the limit on
      the maximum number of arguments, avoid expensive and tricky workarounds
      and to have multiple output arguments.
      
      A referenced type may contain pointers but indirect access through them
      isn't supported.
      
      The implementation consists of two parts.  If a global function has an
      argument that is a pointer to a type with known size then:
      
        1) In btf_check_func_arg_match(): check that the corresponding
      register points to NULL or to a valid memory region that is large enough
      to contain the expected argument's type.
      
        2) In btf_prepare_func_args(): set the corresponding register type to
      PTR_TO_MEM_OR_NULL and its size to the size of the expected type.
      
      Only global functions are supported because allowance of pointers for
      static functions might break validation. Consider the following
      scenario. A static function has a pointer argument. A caller passes
      pointer to its stack memory. Because the callee can change referenced
      memory verifier cannot longer assume any particular slot type of the
      caller's stack memory hence the slot type is changed to SLOT_MISC.  If
      there is an operation that relies on slot type other than SLOT_MISC then
      verifier won't be able to infer safety of the operation.
      
      When verifier sees a static function that has a pointer argument
      different from PTR_TO_CTX then it skips arguments check and continues
      with "inline" validation with more information available. The operation
      that relies on the particular slot type now succeeds.
      
      Because global functions were not allowed to have pointer arguments
      different from PTR_TO_CTX it's not possible to break existing and valid
      code.
      Signed-off-by: NDmitrii Banshchikov <me@ubique.spb.ru>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20210212205642.620788-4-me@ubique.spb.ru
      e5069b9c
  13. 11 2月, 2021 1 次提交
    • A
      bpf: Allow variable-offset stack access · 01f810ac
      Andrei Matei 提交于
      Before this patch, variable offset access to the stack was dissalowed
      for regular instructions, but was allowed for "indirect" accesses (i.e.
      helpers). This patch removes the restriction, allowing reading and
      writing to the stack through stack pointers with variable offsets. This
      makes stack-allocated buffers more usable in programs, and brings stack
      pointers closer to other types of pointers.
      
      The motivation is being able to use stack-allocated buffers for data
      manipulation. When the stack size limit is sufficient, allocating
      buffers on the stack is simpler than per-cpu arrays, or other
      alternatives.
      
      In unpriviledged programs, variable-offset reads and writes are
      disallowed (they were already disallowed for the indirect access case)
      because the speculative execution checking code doesn't support them.
      Additionally, when writing through a variable-offset stack pointer, if
      any pointers are in the accessible range, there's possilibities of later
      leaking pointers because the write cannot be tracked precisely.
      
      Writes with variable offset mark the whole range as initialized, even
      though we don't know which stack slots are actually written. This is in
      order to not reject future reads to these slots. Note that this doesn't
      affect writes done through helpers; like before, helpers need the whole
      stack range to be initialized to begin with.
      All the stack slots are in range are considered scalars after the write;
      variable-offset register spills are not tracked.
      
      For reads, all the stack slots in the variable range needs to be
      initialized (but see above about what writes do), otherwise the read is
      rejected. All register spilled in stack slots that might be read are
      marked as having been read, however reads through such pointers don't do
      register filling; the target register will always be either a scalar or
      a constant zero.
      Signed-off-by: NAndrei Matei <andreimatei1@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20210207011027.676572-2-andreimatei1@gmail.com
      01f810ac
  14. 13 1月, 2021 1 次提交
    • A
      bpf: Support BPF ksym variables in kernel modules · 541c3bad
      Andrii Nakryiko 提交于
      Add support for directly accessing kernel module variables from BPF programs
      using special ldimm64 instructions. This functionality builds upon vmlinux
      ksym support, but extends ldimm64 with src_reg=BPF_PSEUDO_BTF_ID to allow
      specifying kernel module BTF's FD in insn[1].imm field.
      
      During BPF program load time, verifier will resolve FD to BTF object and will
      take reference on BTF object itself and, for module BTFs, corresponding module
      as well, to make sure it won't be unloaded from under running BPF program. The
      mechanism used is similar to how bpf_prog keeps track of used bpf_maps.
      
      One interesting change is also in how per-CPU variable is determined. The
      logic is to find .data..percpu data section in provided BTF, but both vmlinux
      and module each have their own .data..percpu entries in BTF. So for module's
      case, the search for DATASEC record needs to look at only module's added BTF
      types. This is implemented with custom search function.
      Signed-off-by: NAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NYonghong Song <yhs@fb.com>
      Acked-by: NHao Luo <haoluo@google.com>
      Link: https://lore.kernel.org/bpf/20210112075520.4103414-6-andrii@kernel.org
      541c3bad
  15. 04 12月, 2020 1 次提交
    • A
      bpf: Remove hard-coded btf_vmlinux assumption from BPF verifier · 22dc4a0f
      Andrii Nakryiko 提交于
      Remove a permeating assumption thoughout BPF verifier of vmlinux BTF. Instead,
      wherever BTF type IDs are involved, also track the instance of struct btf that
      goes along with the type ID. This allows to gradually add support for kernel
      module BTFs and using/tracking module types across BPF helper calls and
      registers.
      
      This patch also renames btf_id() function to btf_obj_id() to minimize naming
      clash with using btf_id to denote BTF *type* ID, rather than BTF *object*'s ID.
      
      Also, altough btf_vmlinux can't get destructed and thus doesn't need
      refcounting, module BTFs need that, so apply BTF refcounting universally when
      BPF program is using BTF-powered attachment (tp_btf, fentry/fexit, etc). This
      makes for simpler clean up code.
      
      Now that BTF type ID is not enough to uniquely identify a BTF type, extend BPF
      trampoline key to include BTF object ID. To differentiate that from target
      program BPF ID, set 31st bit of type ID. BTF type IDs (at least currently) are
      not allowed to take full 32 bits, so there is no danger of confusing that bit
      with a valid BTF type ID.
      Signed-off-by: NAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20201203204634.1325171-10-andrii@kernel.org
      22dc4a0f
  16. 13 11月, 2020 1 次提交
  17. 03 10月, 2020 1 次提交
    • H
      bpf: Introduce pseudo_btf_id · 4976b718
      Hao Luo 提交于
      Pseudo_btf_id is a type of ld_imm insn that associates a btf_id to a
      ksym so that further dereferences on the ksym can use the BTF info
      to validate accesses. Internally, when seeing a pseudo_btf_id ld insn,
      the verifier reads the btf_id stored in the insn[0]'s imm field and
      marks the dst_reg as PTR_TO_BTF_ID. The btf_id points to a VAR_KIND,
      which is encoded in btf_vminux by pahole. If the VAR is not of a struct
      type, the dst reg will be marked as PTR_TO_MEM instead of PTR_TO_BTF_ID
      and the mem_size is resolved to the size of the VAR's type.
      
      >From the VAR btf_id, the verifier can also read the address of the
      ksym's corresponding kernel var from kallsyms and use that to fill
      dst_reg.
      
      Therefore, the proper functionality of pseudo_btf_id depends on (1)
      kallsyms and (2) the encoding of kernel global VARs in pahole, which
      should be available since pahole v1.18.
      Signed-off-by: NHao Luo <haoluo@google.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Link: https://lore.kernel.org/bpf/20200929235049.2533242-2-haoluo@google.com
      4976b718
  18. 29 9月, 2020 2 次提交
  19. 18 9月, 2020 3 次提交
    • A
      bpf: Add abnormal return checks. · 09b28d76
      Alexei Starovoitov 提交于
      LD_[ABS|IND] instructions may return from the function early. bpf_tail_call
      pseudo instruction is either fallthrough or return. Allow them in the
      subprograms only when subprograms are BTF annotated and have scalar return
      types. Allow ld_abs and tail_call in the main program even if it calls into
      subprograms. In the past that was not ok to do for ld_abs, since it was JITed
      with special exit sequence. Since bpf_gen_ld_abs() was introduced the ld_abs
      looks like normal exit insn from JIT point of view, so it's safe to allow them
      in the main program.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      09b28d76
    • M
      bpf, x64: rework pro/epilogue and tailcall handling in JIT · ebf7d1f5
      Maciej Fijalkowski 提交于
      This commit serves two things:
      1) it optimizes BPF prologue/epilogue generation
      2) it makes possible to have tailcalls within BPF subprogram
      
      Both points are related to each other since without 1), 2) could not be
      achieved.
      
      In [1], Alexei says:
      "The prologue will look like:
      nop5
      xor eax,eax  // two new bytes if bpf_tail_call() is used in this
                   // function
      push rbp
      mov rbp, rsp
      sub rsp, rounded_stack_depth
      push rax // zero init tail_call counter
      variable number of push rbx,r13,r14,r15
      
      Then bpf_tail_call will pop variable number rbx,..
      and final 'pop rax'
      Then 'add rsp, size_of_current_stack_frame'
      jmp to next function and skip over 'nop5; xor eax,eax; push rpb; mov
      rbp, rsp'
      
      This way new function will set its own stack size and will init tail
      call
      counter with whatever value the parent had.
      
      If next function doesn't use bpf_tail_call it won't have 'xor eax,eax'.
      Instead it would need to have 'nop2' in there."
      
      Implement that suggestion.
      
      Since the layout of stack is changed, tail call counter handling can not
      rely anymore on popping it to rbx just like it have been handled for
      constant prologue case and later overwrite of rbx with actual value of
      rbx pushed to stack. Therefore, let's use one of the register (%rcx) that
      is considered to be volatile/caller-saved and pop the value of tail call
      counter in there in the epilogue.
      
      Drop the BUILD_BUG_ON in emit_prologue and in
      emit_bpf_tail_call_indirect where instruction layout is not constant
      anymore.
      
      Introduce new poke target, 'tailcall_bypass' to poke descriptor that is
      dedicated for skipping the register pops and stack unwind that are
      generated right before the actual jump to target program.
      For case when the target program is not present, BPF program will skip
      the pop instructions and nop5 dedicated for jmpq $target. An example of
      such state when only R6 of callee saved registers is used by program:
      
      ffffffffc0513aa1:       e9 0e 00 00 00          jmpq   0xffffffffc0513ab4
      ffffffffc0513aa6:       5b                      pop    %rbx
      ffffffffc0513aa7:       58                      pop    %rax
      ffffffffc0513aa8:       48 81 c4 00 00 00 00    add    $0x0,%rsp
      ffffffffc0513aaf:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
      ffffffffc0513ab4:       48 89 df                mov    %rbx,%rdi
      
      When target program is inserted, the jump that was there to skip
      pops/nop5 will become the nop5, so CPU will go over pops and do the
      actual tailcall.
      
      One might ask why there simply can not be pushes after the nop5?
      In the following example snippet:
      
      ffffffffc037030c:       48 89 fb                mov    %rdi,%rbx
      (...)
      ffffffffc0370332:       5b                      pop    %rbx
      ffffffffc0370333:       58                      pop    %rax
      ffffffffc0370334:       48 81 c4 00 00 00 00    add    $0x0,%rsp
      ffffffffc037033b:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
      ffffffffc0370340:       48 81 ec 00 00 00 00    sub    $0x0,%rsp
      ffffffffc0370347:       50                      push   %rax
      ffffffffc0370348:       53                      push   %rbx
      ffffffffc0370349:       48 89 df                mov    %rbx,%rdi
      ffffffffc037034c:       e8 f7 21 00 00          callq  0xffffffffc0372548
      
      There is the bpf2bpf call (at ffffffffc037034c) right after the tailcall
      and jump target is not present. ctx is in %rbx register and BPF
      subprogram that we will call into on ffffffffc037034c is relying on it,
      e.g. it will pick ctx from there. Such code layout is therefore broken
      as we would overwrite the content of %rbx with the value that was pushed
      on the prologue. That is the reason for the 'bypass' approach.
      
      Special care needs to be taken during the install/update/remove of
      tailcall target. In case when target program is not present, the CPU
      must not execute the pop instructions that precede the tailcall.
      
      To address that, the following states can be defined:
      A nop, unwind, nop
      B nop, unwind, tail
      C skip, unwind, nop
      D skip, unwind, tail
      
      A is forbidden (lead to incorrectness). The state transitions between
      tailcall install/update/remove will work as follows:
      
      First install tail call f: C->D->B(f)
       * poke the tailcall, after that get rid of the skip
      Update tail call f to f': B(f)->B(f')
       * poke the tailcall (poke->tailcall_target) and do NOT touch the
         poke->tailcall_bypass
      Remove tail call: B(f')->C(f')
       * poke->tailcall_bypass is poked back to jump, then we wait the RCU
         grace period so that other programs will finish its execution and
         after that we are safe to remove the poke->tailcall_target
      Install new tail call (f''): C(f')->D(f'')->B(f'').
       * same as first step
      
      This way CPU can never be exposed to "unwind, tail" state.
      
      Last but not least, when tailcalls get mixed with bpf2bpf calls, it
      would be possible to encounter the endless loop due to clearing the
      tailcall counter if for example we would use the tailcall3-like from BPF
      selftests program that would be subprogram-based, meaning the tailcall
      would be present within the BPF subprogram.
      
      This test, broken down to particular steps, would do:
      entry -> set tailcall counter to 0, bump it by 1, tailcall to func0
      func0 -> call subprog_tail
      (we are NOT skipping the first 11 bytes of prologue and this subprogram
      has a tailcall, therefore we clear the counter...)
      subprog -> do the same thing as entry
      
      and then loop forever.
      
      To address this, the idea is to go through the call chain of bpf2bpf progs
      and look for a tailcall presence throughout whole chain. If we saw a single
      tail call then each node in this call chain needs to be marked as a subprog
      that can reach the tailcall. We would later feed the JIT with this info
      and:
      - set eax to 0 only when tailcall is reachable and this is the entry prog
      - if tailcall is reachable but there's no tailcall in insns of currently
        JITed prog then push rax anyway, so that it will be possible to
        propagate further down the call chain
      - finally if tailcall is reachable, then we need to precede the 'call'
        insn with mov rax, [rbp - (stack_depth + 8)]
      
      Tail call related cases from test_verifier kselftest are also working
      fine. Sample BPF programs that utilize tail calls (sockex3, tracex5)
      work properly as well.
      
      [1]: https://lore.kernel.org/bpf/20200517043227.2gpq22ifoq37ogst@ast-mbp.dhcp.thefacebook.com/Suggested-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      ebf7d1f5
    • M
      bpf: Limit caller's stack depth 256 for subprogs with tailcalls · 7f6e4312
      Maciej Fijalkowski 提交于
      Protect against potential stack overflow that might happen when bpf2bpf
      calls get combined with tailcalls. Limit the caller's stack depth for
      such case down to 256 so that the worst case scenario would result in 8k
      stack size (32 which is tailcall limit * 256 = 8k).
      Suggested-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      7f6e4312
  20. 23 6月, 2020 1 次提交
    • A
      bpf: Support access to bpf map fields · 41c48f3a
      Andrey Ignatov 提交于
      There are multiple use-cases when it's convenient to have access to bpf
      map fields, both `struct bpf_map` and map type specific struct-s such as
      `struct bpf_array`, `struct bpf_htab`, etc.
      
      For example while working with sock arrays it can be necessary to
      calculate the key based on map->max_entries (some_hash % max_entries).
      Currently this is solved by communicating max_entries via "out-of-band"
      channel, e.g. via additional map with known key to get info about target
      map. That works, but is not very convenient and error-prone while
      working with many maps.
      
      In other cases necessary data is dynamic (i.e. unknown at loading time)
      and it's impossible to get it at all. For example while working with a
      hash table it can be convenient to know how much capacity is already
      used (bpf_htab.count.counter for BPF_F_NO_PREALLOC case).
      
      At the same time kernel knows this info and can provide it to bpf
      program.
      
      Fill this gap by adding support to access bpf map fields from bpf
      program for both `struct bpf_map` and map type specific fields.
      
      Support is implemented via btf_struct_access() so that a user can define
      their own `struct bpf_map` or map type specific struct in their program
      with only necessary fields and preserve_access_index attribute, cast a
      map to this struct and use a field.
      
      For example:
      
      	struct bpf_map {
      		__u32 max_entries;
      	} __attribute__((preserve_access_index));
      
      	struct bpf_array {
      		struct bpf_map map;
      		__u32 elem_size;
      	} __attribute__((preserve_access_index));
      
      	struct {
      		__uint(type, BPF_MAP_TYPE_ARRAY);
      		__uint(max_entries, 4);
      		__type(key, __u32);
      		__type(value, __u32);
      	} m_array SEC(".maps");
      
      	SEC("cgroup_skb/egress")
      	int cg_skb(void *ctx)
      	{
      		struct bpf_array *array = (struct bpf_array *)&m_array;
      		struct bpf_map *map = (struct bpf_map *)&m_array;
      
      		/* .. use map->max_entries or array->map.max_entries .. */
      	}
      
      Similarly to other btf_struct_access() use-cases (e.g. struct tcp_sock
      in net/ipv4/bpf_tcp_ca.c) the patch allows access to any fields of
      corresponding struct. Only reading from map fields is supported.
      
      For btf_struct_access() to work there should be a way to know btf id of
      a struct that corresponds to a map type. To get btf id there should be a
      way to get a stringified name of map-specific struct, such as
      "bpf_array", "bpf_htab", etc for a map type. Two new fields are added to
      `struct bpf_map_ops` to handle it:
      * .map_btf_name keeps a btf name of a struct returned by map_alloc();
      * .map_btf_id is used to cache btf id of that struct.
      
      To make btf ids calculation cheaper they're calculated once while
      preparing btf_vmlinux and cached same way as it's done for btf_id field
      of `struct bpf_func_proto`
      
      While calculating btf ids, struct names are NOT checked for collision.
      Collisions will be checked as a part of the work to prepare btf ids used
      in verifier in compile time that should land soon. The only known
      collision for `struct bpf_htab` (kernel/bpf/hashtab.c vs
      net/core/sock_map.c) was fixed earlier.
      
      Both new fields .map_btf_name and .map_btf_id must be set for a map type
      for the feature to work. If neither is set for a map type, verifier will
      return ENOTSUPP on a try to access map_ptr of corresponding type. If
      just one of them set, it's verifier misconfiguration.
      
      Only `struct bpf_array` for BPF_MAP_TYPE_ARRAY and `struct bpf_htab` for
      BPF_MAP_TYPE_HASH are supported by this patch. Other map types will be
      supported separately.
      
      The feature is available only for CONFIG_DEBUG_INFO_BTF=y and gated by
      perfmon_capable() so that unpriv programs won't have access to bpf map
      fields.
      Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/6479686a0cd1e9067993df57b4c3eef0e276fec9.1592600985.git.rdna@fb.com
      41c48f3a
  21. 02 6月, 2020 1 次提交
    • A
      bpf: Implement BPF ring buffer and verifier support for it · 457f4436
      Andrii Nakryiko 提交于
      This commit adds a new MPSC ring buffer implementation into BPF ecosystem,
      which allows multiple CPUs to submit data to a single shared ring buffer. On
      the consumption side, only single consumer is assumed.
      
      Motivation
      ----------
      There are two distinctive motivators for this work, which are not satisfied by
      existing perf buffer, which prompted creation of a new ring buffer
      implementation.
        - more efficient memory utilization by sharing ring buffer across CPUs;
        - preserving ordering of events that happen sequentially in time, even
        across multiple CPUs (e.g., fork/exec/exit events for a task).
      
      These two problems are independent, but perf buffer fails to satisfy both.
      Both are a result of a choice to have per-CPU perf ring buffer.  Both can be
      also solved by having an MPSC implementation of ring buffer. The ordering
      problem could technically be solved for perf buffer with some in-kernel
      counting, but given the first one requires an MPSC buffer, the same solution
      would solve the second problem automatically.
      
      Semantics and APIs
      ------------------
      Single ring buffer is presented to BPF programs as an instance of BPF map of
      type BPF_MAP_TYPE_RINGBUF. Two other alternatives considered, but ultimately
      rejected.
      
      One way would be to, similar to BPF_MAP_TYPE_PERF_EVENT_ARRAY, make
      BPF_MAP_TYPE_RINGBUF could represent an array of ring buffers, but not enforce
      "same CPU only" rule. This would be more familiar interface compatible with
      existing perf buffer use in BPF, but would fail if application needed more
      advanced logic to lookup ring buffer by arbitrary key. HASH_OF_MAPS addresses
      this with current approach. Additionally, given the performance of BPF
      ringbuf, many use cases would just opt into a simple single ring buffer shared
      among all CPUs, for which current approach would be an overkill.
      
      Another approach could introduce a new concept, alongside BPF map, to
      represent generic "container" object, which doesn't necessarily have key/value
      interface with lookup/update/delete operations. This approach would add a lot
      of extra infrastructure that has to be built for observability and verifier
      support. It would also add another concept that BPF developers would have to
      familiarize themselves with, new syntax in libbpf, etc. But then would really
      provide no additional benefits over the approach of using a map.
      BPF_MAP_TYPE_RINGBUF doesn't support lookup/update/delete operations, but so
      doesn't few other map types (e.g., queue and stack; array doesn't support
      delete, etc).
      
      The approach chosen has an advantage of re-using existing BPF map
      infrastructure (introspection APIs in kernel, libbpf support, etc), being
      familiar concept (no need to teach users a new type of object in BPF program),
      and utilizing existing tooling (bpftool). For common scenario of using
      a single ring buffer for all CPUs, it's as simple and straightforward, as
      would be with a dedicated "container" object. On the other hand, by being
      a map, it can be combined with ARRAY_OF_MAPS and HASH_OF_MAPS map-in-maps to
      implement a wide variety of topologies, from one ring buffer for each CPU
      (e.g., as a replacement for perf buffer use cases), to a complicated
      application hashing/sharding of ring buffers (e.g., having a small pool of
      ring buffers with hashed task's tgid being a look up key to preserve order,
      but reduce contention).
      
      Key and value sizes are enforced to be zero. max_entries is used to specify
      the size of ring buffer and has to be a power of 2 value.
      
      There are a bunch of similarities between perf buffer
      (BPF_MAP_TYPE_PERF_EVENT_ARRAY) and new BPF ring buffer semantics:
        - variable-length records;
        - if there is no more space left in ring buffer, reservation fails, no
          blocking;
        - memory-mappable data area for user-space applications for ease of
          consumption and high performance;
        - epoll notifications for new incoming data;
        - but still the ability to do busy polling for new data to achieve the
          lowest latency, if necessary.
      
      BPF ringbuf provides two sets of APIs to BPF programs:
        - bpf_ringbuf_output() allows to *copy* data from one place to a ring
          buffer, similarly to bpf_perf_event_output();
        - bpf_ringbuf_reserve()/bpf_ringbuf_commit()/bpf_ringbuf_discard() APIs
          split the whole process into two steps. First, a fixed amount of space is
          reserved. If successful, a pointer to a data inside ring buffer data area
          is returned, which BPF programs can use similarly to a data inside
          array/hash maps. Once ready, this piece of memory is either committed or
          discarded. Discard is similar to commit, but makes consumer ignore the
          record.
      
      bpf_ringbuf_output() has disadvantage of incurring extra memory copy, because
      record has to be prepared in some other place first. But it allows to submit
      records of the length that's not known to verifier beforehand. It also closely
      matches bpf_perf_event_output(), so will simplify migration significantly.
      
      bpf_ringbuf_reserve() avoids the extra copy of memory by providing a memory
      pointer directly to ring buffer memory. In a lot of cases records are larger
      than BPF stack space allows, so many programs have use extra per-CPU array as
      a temporary heap for preparing sample. bpf_ringbuf_reserve() avoid this needs
      completely. But in exchange, it only allows a known constant size of memory to
      be reserved, such that verifier can verify that BPF program can't access
      memory outside its reserved record space. bpf_ringbuf_output(), while slightly
      slower due to extra memory copy, covers some use cases that are not suitable
      for bpf_ringbuf_reserve().
      
      The difference between commit and discard is very small. Discard just marks
      a record as discarded, and such records are supposed to be ignored by consumer
      code. Discard is useful for some advanced use-cases, such as ensuring
      all-or-nothing multi-record submission, or emulating temporary malloc()/free()
      within single BPF program invocation.
      
      Each reserved record is tracked by verifier through existing
      reference-tracking logic, similar to socket ref-tracking. It is thus
      impossible to reserve a record, but forget to submit (or discard) it.
      
      bpf_ringbuf_query() helper allows to query various properties of ring buffer.
      Currently 4 are supported:
        - BPF_RB_AVAIL_DATA returns amount of unconsumed data in ring buffer;
        - BPF_RB_RING_SIZE returns the size of ring buffer;
        - BPF_RB_CONS_POS/BPF_RB_PROD_POS returns current logical possition of
          consumer/producer, respectively.
      Returned values are momentarily snapshots of ring buffer state and could be
      off by the time helper returns, so this should be used only for
      debugging/reporting reasons or for implementing various heuristics, that take
      into account highly-changeable nature of some of those characteristics.
      
      One such heuristic might involve more fine-grained control over poll/epoll
      notifications about new data availability in ring buffer. Together with
      BPF_RB_NO_WAKEUP/BPF_RB_FORCE_WAKEUP flags for output/commit/discard helpers,
      it allows BPF program a high degree of control and, e.g., more efficient
      batched notifications. Default self-balancing strategy, though, should be
      adequate for most applications and will work reliable and efficiently already.
      
      Design and implementation
      -------------------------
      This reserve/commit schema allows a natural way for multiple producers, either
      on different CPUs or even on the same CPU/in the same BPF program, to reserve
      independent records and work with them without blocking other producers. This
      means that if BPF program was interruped by another BPF program sharing the
      same ring buffer, they will both get a record reserved (provided there is
      enough space left) and can work with it and submit it independently. This
      applies to NMI context as well, except that due to using a spinlock during
      reservation, in NMI context, bpf_ringbuf_reserve() might fail to get a lock,
      in which case reservation will fail even if ring buffer is not full.
      
      The ring buffer itself internally is implemented as a power-of-2 sized
      circular buffer, with two logical and ever-increasing counters (which might
      wrap around on 32-bit architectures, that's not a problem):
        - consumer counter shows up to which logical position consumer consumed the
          data;
        - producer counter denotes amount of data reserved by all producers.
      
      Each time a record is reserved, producer that "owns" the record will
      successfully advance producer counter. At that point, data is still not yet
      ready to be consumed, though. Each record has 8 byte header, which contains
      the length of reserved record, as well as two extra bits: busy bit to denote
      that record is still being worked on, and discard bit, which might be set at
      commit time if record is discarded. In the latter case, consumer is supposed
      to skip the record and move on to the next one. Record header also encodes
      record's relative offset from the beginning of ring buffer data area (in
      pages). This allows bpf_ringbuf_commit()/bpf_ringbuf_discard() to accept only
      the pointer to the record itself, without requiring also the pointer to ring
      buffer itself. Ring buffer memory location will be restored from record
      metadata header. This significantly simplifies verifier, as well as improving
      API usability.
      
      Producer counter increments are serialized under spinlock, so there is
      a strict ordering between reservations. Commits, on the other hand, are
      completely lockless and independent. All records become available to consumer
      in the order of reservations, but only after all previous records where
      already committed. It is thus possible for slow producers to temporarily hold
      off submitted records, that were reserved later.
      
      Reservation/commit/consumer protocol is verified by litmus tests in
      Documentation/litmus-test/bpf-rb.
      
      One interesting implementation bit, that significantly simplifies (and thus
      speeds up as well) implementation of both producers and consumers is how data
      area is mapped twice contiguously back-to-back in the virtual memory. This
      allows to not take any special measures for samples that have to wrap around
      at the end of the circular buffer data area, because the next page after the
      last data page would be first data page again, and thus the sample will still
      appear completely contiguous in virtual memory. See comment and a simple ASCII
      diagram showing this visually in bpf_ringbuf_area_alloc().
      
      Another feature that distinguishes BPF ringbuf from perf ring buffer is
      a self-pacing notifications of new data being availability.
      bpf_ringbuf_commit() implementation will send a notification of new record
      being available after commit only if consumer has already caught up right up
      to the record being committed. If not, consumer still has to catch up and thus
      will see new data anyways without needing an extra poll notification.
      Benchmarks (see tools/testing/selftests/bpf/benchs/bench_ringbuf.c) show that
      this allows to achieve a very high throughput without having to resort to
      tricks like "notify only every Nth sample", which are necessary with perf
      buffer. For extreme cases, when BPF program wants more manual control of
      notifications, commit/discard/output helpers accept BPF_RB_NO_WAKEUP and
      BPF_RB_FORCE_WAKEUP flags, which give full control over notifications of data
      availability, but require extra caution and diligence in using this API.
      
      Comparison to alternatives
      --------------------------
      Before considering implementing BPF ring buffer from scratch existing
      alternatives in kernel were evaluated, but didn't seem to meet the needs. They
      largely fell into few categores:
        - per-CPU buffers (perf, ftrace, etc), which don't satisfy two motivations
          outlined above (ordering and memory consumption);
        - linked list-based implementations; while some were multi-producer designs,
          consuming these from user-space would be very complicated and most
          probably not performant; memory-mapping contiguous piece of memory is
          simpler and more performant for user-space consumers;
        - io_uring is SPSC, but also requires fixed-sized elements. Naively turning
          SPSC queue into MPSC w/ lock would have subpar performance compared to
          locked reserve + lockless commit, as with BPF ring buffer. Fixed sized
          elements would be too limiting for BPF programs, given existing BPF
          programs heavily rely on variable-sized perf buffer already;
        - specialized implementations (like a new printk ring buffer, [0]) with lots
          of printk-specific limitations and implications, that didn't seem to fit
          well for intended use with BPF programs.
      
        [0] https://lwn.net/Articles/779550/Signed-off-by: NAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20200529075424.3139988-2-andriin@fb.comSigned-off-by: NAlexei Starovoitov <ast@kernel.org>
      457f4436
  22. 15 5月, 2020 1 次提交
    • A
      bpf: Implement CAP_BPF · 2c78ee89
      Alexei Starovoitov 提交于
      Implement permissions as stated in uapi/linux/capability.h
      In order to do that the verifier allow_ptr_leaks flag is split
      into four flags and they are set as:
        env->allow_ptr_leaks = bpf_allow_ptr_leaks();
        env->bypass_spec_v1 = bpf_bypass_spec_v1();
        env->bypass_spec_v4 = bpf_bypass_spec_v4();
        env->bpf_capable = bpf_capable();
      
      The first three currently equivalent to perfmon_capable(), since leaking kernel
      pointers and reading kernel memory via side channel attacks is roughly
      equivalent to reading kernel memory with cap_perfmon.
      
      'bpf_capable' enables bounded loops, precision tracking, bpf to bpf calls and
      other verifier features. 'allow_ptr_leaks' enable ptr leaks, ptr conversions,
      subtraction of pointers. 'bypass_spec_v1' disables speculative analysis in the
      verifier, run time mitigations in bpf array, and enables indirect variable
      access in bpf programs. 'bypass_spec_v4' disables emission of sanitation code
      by the verifier.
      
      That means that the networking BPF program loaded with CAP_BPF + CAP_NET_ADMIN
      will have speculative checks done by the verifier and other spectre mitigation
      applied. Such networking BPF program will not be able to leak kernel pointers
      and will not be able to access arbitrary kernel memory.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20200513230355.7858-3-alexei.starovoitov@gmail.com
      2c78ee89
  23. 31 3月, 2020 1 次提交
    • J
      bpf: Verifier, do explicit ALU32 bounds tracking · 3f50f132
      John Fastabend 提交于
      It is not possible for the current verifier to track ALU32 and JMP ops
      correctly. This can result in the verifier aborting with errors even though
      the program should be verifiable. BPF codes that hit this can work around
      it by changin int variables to 64-bit types, marking variables volatile,
      etc. But this is all very ugly so it would be better to avoid these tricks.
      
      But, the main reason to address this now is do_refine_retval_range() was
      assuming return values could not be negative. Once we fixed this code that
      was previously working will no longer work. See do_refine_retval_range()
      patch for details. And we don't want to suddenly cause programs that used
      to work to fail.
      
      The simplest example code snippet that illustrates the problem is likely
      this,
      
       53: w8 = w0                    // r8 <- [0, S32_MAX],
                                      // w8 <- [-S32_MIN, X]
       54: w8 <s 0                    // r8 <- [0, U32_MAX]
                                      // w8 <- [0, X]
      
      The expected 64-bit and 32-bit bounds after each line are shown on the
      right. The current issue is without the w* bounds we are forced to use
      the worst case bound of [0, U32_MAX]. To resolve this type of case,
      jmp32 creating divergent 32-bit bounds from 64-bit bounds, we add explicit
      32-bit register bounds s32_{min|max}_value and u32_{min|max}_value. Then
      from branch_taken logic creating new bounds we can track 32-bit bounds
      explicitly.
      
      The next case we observed is ALU ops after the jmp32,
      
       53: w8 = w0                    // r8 <- [0, S32_MAX],
                                      // w8 <- [-S32_MIN, X]
       54: w8 <s 0                    // r8 <- [0, U32_MAX]
                                      // w8 <- [0, X]
       55: w8 += 1                    // r8 <- [0, U32_MAX+1]
                                      // w8 <- [0, X+1]
      
      In order to keep the bounds accurate at this point we also need to track
      ALU32 ops. To do this we add explicit ALU32 logic for each of the ALU
      ops, mov, add, sub, etc.
      
      Finally there is a question of how and when to merge bounds. The cases
      enumerate here,
      
      1. MOV ALU32   - zext 32-bit -> 64-bit
      2. MOV ALU64   - copy 64-bit -> 32-bit
      3. op  ALU32   - zext 32-bit -> 64-bit
      4. op  ALU64   - n/a
      5. jmp ALU32   - 64-bit: var32_off | upper_32_bits(var64_off)
      6. jmp ALU64   - 32-bit: (>> (<< var64_off))
      
      Details for each case,
      
      For "MOV ALU32" BPF arch zero extends so we simply copy the bounds
      from 32-bit into 64-bit ensuring we truncate var_off and 64-bit
      bounds correctly. See zext_32_to_64.
      
      For "MOV ALU64" copy all bounds including 32-bit into new register. If
      the src register had 32-bit bounds the dst register will as well.
      
      For "op ALU32" zero extend 32-bit into 64-bit the same as move,
      see zext_32_to_64.
      
      For "op ALU64" calculate both 32-bit and 64-bit bounds no merging
      is done here. Except we have a special case. When RSH or ARSH is
      done we can't simply ignore shifting bits from 64-bit reg into the
      32-bit subreg. So currently just push bounds from 64-bit into 32-bit.
      This will be correct in the sense that they will represent a valid
      state of the register. However we could lose some accuracy if an
      ARSH is following a jmp32 operation. We can handle this special
      case in a follow up series.
      
      For "jmp ALU32" mark 64-bit reg unknown and recalculate 64-bit bounds
      from tnum by setting var_off to ((<<(>>var_off)) | var32_off). We
      special case if 64-bit bounds has zero'd upper 32bits at which point
      we can simply copy 32-bit bounds into 64-bit register. This catches
      a common compiler trick where upper 32-bits are zeroed and then
      32-bit ops are used followed by a 64-bit compare or 64-bit op on
      a pointer. See __reg_combine_64_into_32().
      
      For "jmp ALU64" cast the bounds of the 64bit to their 32-bit
      counterpart. For example s32_min_value = (s32)reg->smin_value. For
      tnum use only the lower 32bits via, (>>(<<var_off)). See
      __reg_combine_64_into_32().
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/158560419880.10843.11448220440809118343.stgit@john-Precision-5820-Tower
      3f50f132
  24. 11 1月, 2020 1 次提交
    • A
      bpf: Introduce function-by-function verification · 51c39bb1
      Alexei Starovoitov 提交于
      New llvm and old llvm with libbpf help produce BTF that distinguish global and
      static functions. Unlike arguments of static function the arguments of global
      functions cannot be removed or optimized away by llvm. The compiler has to use
      exactly the arguments specified in a function prototype. The argument type
      information allows the verifier validate each global function independently.
      For now only supported argument types are pointer to context and scalars. In
      the future pointers to structures, sizes, pointer to packet data can be
      supported as well. Consider the following example:
      
      static int f1(int ...)
      {
        ...
      }
      
      int f3(int b);
      
      int f2(int a)
      {
        f1(a) + f3(a);
      }
      
      int f3(int b)
      {
        ...
      }
      
      int main(...)
      {
        f1(...) + f2(...) + f3(...);
      }
      
      The verifier will start its safety checks from the first global function f2().
      It will recursively descend into f1() because it's static. Then it will check
      that arguments match for the f3() invocation inside f2(). It will not descend
      into f3(). It will finish f2() that has to be successfully verified for all
      possible values of 'a'. Then it will proceed with f3(). That function also has
      to be safe for all possible values of 'b'. Then it will start subprog 0 (which
      is main() function). It will recursively descend into f1() and will skip full
      check of f2() and f3(), since they are global. The order of processing global
      functions doesn't affect safety, since all global functions must be proven safe
      based on their arguments only.
      
      Such function by function verification can drastically improve speed of the
      verification and reduce complexity.
      
      Note that the stack limit of 512 still applies to the call chain regardless whether
      functions were static or global. The nested level of 8 also still applies. The
      same recursion prevention checks are in place as well.
      
      The type information and static/global kind is preserved after the verification
      hence in the above example global function f2() and f3() can be replaced later
      by equivalent functions with the same types that are loaded and verified later
      without affecting safety of this main() program. Such replacement (re-linking)
      of global functions is a subject of future patches.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Link: https://lore.kernel.org/bpf/20200110064124.1760511-3-ast@kernel.org
      51c39bb1
  25. 25 11月, 2019 1 次提交
    • D
      bpf: Constant map key tracking for prog array pokes · d2e4c1e6
      Daniel Borkmann 提交于
      Add tracking of constant keys into tail call maps. The signature of
      bpf_tail_call_proto is that arg1 is ctx, arg2 map pointer and arg3
      is a index key. The direct call approach for tail calls can be enabled
      if the verifier asserted that for all branches leading to the tail call
      helper invocation, the map pointer and index key were both constant
      and the same.
      
      Tracking of map pointers we already do from prior work via c93552c4
      ("bpf: properly enforce index mask to prevent out-of-bounds speculation")
      and 09772d92 ("bpf: avoid retpoline for lookup/update/ delete calls
      on maps").
      
      Given the tail call map index key is not on stack but directly in the
      register, we can add similar tracking approach and later in fixup_bpf_calls()
      add a poke descriptor to the progs poke_tab with the relevant information
      for the JITing phase.
      
      We internally reuse insn->imm for the rewritten BPF_JMP | BPF_TAIL_CALL
      instruction in order to point into the prog's poke_tab, and keep insn->imm
      as 0 as indicator that current indirect tail call emission must be used.
      Note that publishing to the tracker must happen at the end of fixup_bpf_calls()
      since adding elements to the poke_tab reallocates its memory, so we need
      to wait until its in final state.
      
      Future work can generalize and add similar approach to optimize plain
      array map lookups. Difference there is that we need to look into the key
      value that sits on stack. For clarity in bpf_insn_aux_data, map_state
      has been renamed into map_ptr_state, so we get map_{ptr,key}_state as
      trackers.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Link: https://lore.kernel.org/bpf/e8db37f6b2ae60402fa40216c96738ee9b316c32.1574452833.git.daniel@iogearbox.net
      d2e4c1e6
  26. 16 11月, 2019 1 次提交
  27. 17 10月, 2019 2 次提交
    • A
      bpf: Implement accurate raw_tp context access via BTF · 9e15db66
      Alexei Starovoitov 提交于
      libbpf analyzes bpf C program, searches in-kernel BTF for given type name
      and stores it into expected_attach_type.
      The kernel verifier expects this btf_id to point to something like:
      typedef void (*btf_trace_kfree_skb)(void *, struct sk_buff *skb, void *loc);
      which represents signature of raw_tracepoint "kfree_skb".
      
      Then btf_ctx_access() matches ctx+0 access in bpf program with 'skb'
      and 'ctx+8' access with 'loc' arguments of "kfree_skb" tracepoint.
      In first case it passes btf_id of 'struct sk_buff *' back to the verifier core
      and 'void *' in second case.
      
      Then the verifier tracks PTR_TO_BTF_ID as any other pointer type.
      Like PTR_TO_SOCKET points to 'struct bpf_sock',
      PTR_TO_TCP_SOCK points to 'struct bpf_tcp_sock', and so on.
      PTR_TO_BTF_ID points to in-kernel structs.
      If 1234 is btf_id of 'struct sk_buff' in vmlinux's BTF
      then PTR_TO_BTF_ID#1234 points to one of in kernel skbs.
      
      When PTR_TO_BTF_ID#1234 is dereferenced (like r2 = *(u64 *)r1 + 32)
      the btf_struct_access() checks which field of 'struct sk_buff' is
      at offset 32. Checks that size of access matches type definition
      of the field and continues to track the dereferenced type.
      If that field was a pointer to 'struct net_device' the r2's type
      will be PTR_TO_BTF_ID#456. Where 456 is btf_id of 'struct net_device'
      in vmlinux's BTF.
      
      Such verifier analysis prevents "cheating" in BPF C program.
      The program cannot cast arbitrary pointer to 'struct sk_buff *'
      and access it. C compiler would allow type cast, of course,
      but the verifier will notice type mismatch based on BPF assembly
      and in-kernel BTF.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/20191016032505.2089704-7-ast@kernel.org
      9e15db66
    • A
      bpf: Process in-kernel BTF · 8580ac94
      Alexei Starovoitov 提交于
      If in-kernel BTF exists parse it and prepare 'struct btf *btf_vmlinux'
      for further use by the verifier.
      In-kernel BTF is trusted just like kallsyms and other build artifacts
      embedded into vmlinux.
      Yet run this BTF image through BTF verifier to make sure
      that it is valid and it wasn't mangled during the build.
      If in-kernel BTF is incorrect it means either gcc or pahole or kernel
      are buggy. In such case disallow loading BPF programs.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/20191016032505.2089704-4-ast@kernel.org
      8580ac94
  28. 28 8月, 2019 1 次提交
  29. 19 6月, 2019 2 次提交
    • A
      bpf: precise scalar_value tracking · b5dc0163
      Alexei Starovoitov 提交于
      Introduce precision tracking logic that
      helps cilium programs the most:
                        old clang  old clang    new clang  new clang
                                with all patches         with all patches
      bpf_lb-DLB_L3.o      1838     2283         1923       1863
      bpf_lb-DLB_L4.o      3218     2657         3077       2468
      bpf_lb-DUNKNOWN.o    1064     545          1062       544
      bpf_lxc-DDROP_ALL.o  26935    23045        166729     22629
      bpf_lxc-DUNKNOWN.o   34439    35240        174607     28805
      bpf_netdev.o         9721     8753         8407       6801
      bpf_overlay.o        6184     7901         5420       4754
      bpf_lxc_jit.o        39389    50925        39389      50925
      
      Consider code:
      654: (85) call bpf_get_hash_recalc#34
      655: (bf) r7 = r0
      656: (15) if r8 == 0x0 goto pc+29
      657: (bf) r2 = r10
      658: (07) r2 += -48
      659: (18) r1 = 0xffff8881e41e1b00
      661: (85) call bpf_map_lookup_elem#1
      662: (15) if r0 == 0x0 goto pc+23
      663: (69) r1 = *(u16 *)(r0 +0)
      664: (15) if r1 == 0x0 goto pc+21
      665: (bf) r8 = r7
      666: (57) r8 &= 65535
      667: (bf) r2 = r8
      668: (3f) r2 /= r1
      669: (2f) r2 *= r1
      670: (bf) r1 = r8
      671: (1f) r1 -= r2
      672: (57) r1 &= 255
      673: (25) if r1 > 0x1e goto pc+12
       R0=map_value(id=0,off=0,ks=20,vs=64,imm=0) R1_w=inv(id=0,umax_value=30,var_off=(0x0; 0x1f))
      674: (67) r1 <<= 1
      675: (0f) r0 += r1
      
      At this point the verifier will notice that scalar R1 is used in map pointer adjustment.
      R1 has to be precise for later operations on R0 to be validated properly.
      
      The verifier will backtrack the above code in the following way:
      last_idx 675 first_idx 664
      regs=2 stack=0 before 675: (0f) r0 += r1         // started backtracking R1 regs=2 is a bitmask
      regs=2 stack=0 before 674: (67) r1 <<= 1
      regs=2 stack=0 before 673: (25) if r1 > 0x1e goto pc+12
      regs=2 stack=0 before 672: (57) r1 &= 255
      regs=2 stack=0 before 671: (1f) r1 -= r2         // now both R1 and R2 has to be precise -> regs=6 mask
      regs=6 stack=0 before 670: (bf) r1 = r8          // after this insn R8 and R2 has to be precise
      regs=104 stack=0 before 669: (2f) r2 *= r1       // after this one R8, R2, and R1
      regs=106 stack=0 before 668: (3f) r2 /= r1
      regs=106 stack=0 before 667: (bf) r2 = r8
      regs=102 stack=0 before 666: (57) r8 &= 65535
      regs=102 stack=0 before 665: (bf) r8 = r7
      regs=82 stack=0 before 664: (15) if r1 == 0x0 goto pc+21
       // this is the end of verifier state. The following regs will be marked precised:
       R1_rw=invP(id=0,umax_value=65535,var_off=(0x0; 0xffff)) R7_rw=invP(id=0)
      parent didn't have regs=82 stack=0 marks         // so backtracking continues into parent state
      last_idx 663 first_idx 655
      regs=82 stack=0 before 663: (69) r1 = *(u16 *)(r0 +0)   // R1 was assigned no need to track it further
      regs=80 stack=0 before 662: (15) if r0 == 0x0 goto pc+23    // keep tracking R7
      regs=80 stack=0 before 661: (85) call bpf_map_lookup_elem#1  // keep tracking R7
      regs=80 stack=0 before 659: (18) r1 = 0xffff8881e41e1b00
      regs=80 stack=0 before 658: (07) r2 += -48
      regs=80 stack=0 before 657: (bf) r2 = r10
      regs=80 stack=0 before 656: (15) if r8 == 0x0 goto pc+29
      regs=80 stack=0 before 655: (bf) r7 = r0                // here the assignment into R7
       // mark R0 to be precise:
       R0_rw=invP(id=0)
      parent didn't have regs=1 stack=0 marks                 // regs=1 -> tracking R0
      last_idx 654 first_idx 644
      regs=1 stack=0 before 654: (85) call bpf_get_hash_recalc#34 // and in the parent frame it was a return value
        // nothing further to backtrack
      
      Two scalar registers not marked precise are equivalent from state pruning point of view.
      More details in the patch comments.
      
      It doesn't support bpf2bpf calls yet and enabled for root only.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      b5dc0163
    • A
      bpf: introduce bounded loops · 2589726d
      Alexei Starovoitov 提交于
      Allow the verifier to validate the loops by simulating their execution.
      Exisiting programs have used '#pragma unroll' to unroll the loops
      by the compiler. Instead let the verifier simulate all iterations
      of the loop.
      In order to do that introduce parentage chain of bpf_verifier_state and
      'branches' counter for the number of branches left to explore.
      See more detailed algorithm description in bpf_verifier.h
      
      This algorithm borrows the key idea from Edward Cree approach:
      https://patchwork.ozlabs.org/patch/877222/
      Additional state pruning heuristics make such brute force loop walk
      practical even for large loops.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      2589726d
  30. 31 5月, 2019 1 次提交