1. 23 4月, 2019 1 次提交
  2. 10 4月, 2019 1 次提交
    • D
      bpf: implement lookup-free direct value access for maps · d8eca5bb
      Daniel Borkmann 提交于
      This generic extension to BPF maps allows for directly loading
      an address residing inside a BPF map value as a single BPF
      ldimm64 instruction!
      
      The idea is similar to what BPF_PSEUDO_MAP_FD does today, which
      is a special src_reg flag for ldimm64 instruction that indicates
      that inside the first part of the double insns's imm field is a
      file descriptor which the verifier then replaces as a full 64bit
      address of the map into both imm parts. For the newly added
      BPF_PSEUDO_MAP_VALUE src_reg flag, the idea is the following:
      the first part of the double insns's imm field is again a file
      descriptor corresponding to the map, and the second part of the
      imm field is an offset into the value. The verifier will then
      replace both imm parts with an address that points into the BPF
      map value at the given value offset for maps that support this
      operation. Currently supported is array map with single entry.
      It is possible to support more than just single map element by
      reusing both 16bit off fields of the insns as a map index, so
      full array map lookup could be expressed that way. It hasn't
      been implemented here due to lack of concrete use case, but
      could easily be done so in future in a compatible way, since
      both off fields right now have to be 0 and would correctly
      denote a map index 0.
      
      The BPF_PSEUDO_MAP_VALUE is a distinct flag as otherwise with
      BPF_PSEUDO_MAP_FD we could not differ offset 0 between load of
      map pointer versus load of map's value at offset 0, and changing
      BPF_PSEUDO_MAP_FD's encoding into off by one to differ between
      regular map pointer and map value pointer would add unnecessary
      complexity and increases barrier for debugability thus less
      suitable. Using the second part of the imm field as an offset
      into the value does /not/ come with limitations since maximum
      possible value size is in u32 universe anyway.
      
      This optimization allows for efficiently retrieving an address
      to a map value memory area without having to issue a helper call
      which needs to prepare registers according to calling convention,
      etc, without needing the extra NULL test, and without having to
      add the offset in an additional instruction to the value base
      pointer. The verifier then treats the destination register as
      PTR_TO_MAP_VALUE with constant reg->off from the user passed
      offset from the second imm field, and guarantees that this is
      within bounds of the map value. Any subsequent operations are
      normally treated as typical map value handling without anything
      extra needed from verification side.
      
      The two map operations for direct value access have been added to
      array map for now. In future other types could be supported as
      well depending on the use case. The main use case for this commit
      is to allow for BPF loader support for global variables that
      reside in .data/.rodata/.bss sections such that we can directly
      load the address of them with minimal additional infrastructure
      required. Loader support has been added in subsequent commits for
      libbpf library.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      d8eca5bb
  3. 04 4月, 2019 2 次提交
    • A
      bpf: improve verification speed by droping states · 9f4686c4
      Alexei Starovoitov 提交于
      Branch instructions, branch targets and calls in a bpf program are
      the places where the verifier remembers states that led to successful
      verification of the program.
      These states are used to prune brute force program analysis.
      For unprivileged programs there is a limit of 64 states per such
      'branching' instructions (maximum length is tracked by max_states_per_insn
      counter introduced in the previous patch).
      Simply reducing this threshold to 32 or lower increases insn_processed
      metric to the point that small valid programs get rejected.
      For root programs there is no limit and cilium programs can have
      max_states_per_insn to be 100 or higher.
      Walking 100+ states multiplied by number of 'branching' insns during
      verification consumes significant amount of cpu time.
      Turned out simple LRU-like mechanism can be used to remove states
      that unlikely will be helpful in future search pruning.
      This patch introduces hit_cnt and miss_cnt counters:
      hit_cnt - this many times this state successfully pruned the search
      miss_cnt - this many times this state was not equivalent to other states
      (and that other states were added to state list)
      
      The heuristic introduced in this patch is:
      if (sl->miss_cnt > sl->hit_cnt * 3 + 3)
        /* drop this state from future considerations */
      
      Higher numbers increase max_states_per_insn (allow more states to be
      considered for pruning) and slow verification speed, but do not meaningfully
      reduce insn_processed metric.
      Lower numbers drop too many states and insn_processed increases too much.
      Many different formulas were considered.
      This one is simple and works well enough in practice.
      (the analysis was done on selftests/progs/* and on cilium programs)
      
      The end result is this heuristic improves verification speed by 10 times.
      Large synthetic programs that used to take a second more now take
      1/10 of a second.
      In cases where max_states_per_insn used to be 100 or more, now it's ~10.
      
      There is a slight increase in insn_processed for cilium progs:
                             before   after
      bpf_lb-DLB_L3.o 	1831	1838
      bpf_lb-DLB_L4.o 	3029	3218
      bpf_lb-DUNKNOWN.o 	1064	1064
      bpf_lxc-DDROP_ALL.o	26309	26935
      bpf_lxc-DUNKNOWN.o	33517	34439
      bpf_netdev.o		9713	9721
      bpf_overlay.o		6184	6184
      bpf_lcx_jit.o		37335	39389
      And 2-3 times improvement in the verification speed.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Reviewed-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      9f4686c4
    • A
      bpf: add verifier stats and log_level bit 2 · 06ee7115
      Alexei Starovoitov 提交于
      In order to understand the verifier bottlenecks add various stats
      and extend log_level:
      log_level 1 and 2 are kept as-is:
      bit 0 - level=1 - print every insn and verifier state at branch points
      bit 1 - level=2 - print every insn and verifier state at every insn
      bit 2 - level=4 - print verifier error and stats at the end of verification
      
      When verifier rejects the program the libbpf is trying to load the program twice.
      Once with log_level=0 (no messages, only error code is reported to user space)
      and second time with log_level=1 to tell the user why the verifier rejected it.
      
      With introduction of bit 2 - level=4 the libbpf can choose to always use that
      level and load programs once, since the verification speed is not affected and
      in case of error the verbose message will be available.
      
      Note that the verifier stats are not part of uapi just like all other
      verbose messages. They're expected to change in the future.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      06ee7115
  4. 14 3月, 2019 1 次提交
    • M
      bpf: Fix bpf_tcp_sock and bpf_sk_fullsock issue related to bpf_sk_release · 1b986589
      Martin KaFai Lau 提交于
      Lorenz Bauer [thanks!] reported that a ptr returned by bpf_tcp_sock(sk)
      can still be accessed after bpf_sk_release(sk).
      Both bpf_tcp_sock() and bpf_sk_fullsock() have the same issue.
      This patch addresses them together.
      
      A simple reproducer looks like this:
      
      	sk = bpf_sk_lookup_tcp();
      	/* if (!sk) ... */
      	tp = bpf_tcp_sock(sk);
      	/* if (!tp) ... */
      	bpf_sk_release(sk);
      	snd_cwnd = tp->snd_cwnd; /* oops! The verifier does not complain. */
      
      The problem is the verifier did not scrub the register's states of
      the tcp_sock ptr (tp) after bpf_sk_release(sk).
      
      [ Note that when calling bpf_tcp_sock(sk), the sk is not always
        refcount-acquired. e.g. bpf_tcp_sock(skb->sk). The verifier works
        fine for this case. ]
      
      Currently, the verifier does not track if a helper's return ptr (in REG_0)
      is "carry"-ing one of its argument's refcount status. To carry this info,
      the reg1->id needs to be stored in reg0.
      
      One approach was tried, like "reg0->id = reg1->id", when calling
      "bpf_tcp_sock()".  The main idea was to avoid adding another "ref_obj_id"
      for the same reg.  However, overlapping the NULL marking and ref
      tracking purpose in one "id" does not work well:
      
      	ref_sk = bpf_sk_lookup_tcp();
      	fullsock = bpf_sk_fullsock(ref_sk);
      	tp = bpf_tcp_sock(ref_sk);
      	if (!fullsock) {
      	     bpf_sk_release(ref_sk);
      	     return 0;
      	}
      	/* fullsock_reg->id is marked for NOT-NULL.
      	 * Same for tp_reg->id because they have the same id.
      	 */
      
      	/* oops. verifier did not complain about the missing !tp check */
      	snd_cwnd = tp->snd_cwnd;
      
      Hence, a new "ref_obj_id" is needed in "struct bpf_reg_state".
      With a new ref_obj_id, when bpf_sk_release(sk) is called, the verifier can
      scrub all reg states which has a ref_obj_id match.  It is done with the
      changes in release_reg_references() in this patch.
      
      While fixing it, sk_to_full_sk() is removed from bpf_tcp_sock() and
      bpf_sk_fullsock() to avoid these helpers from returning
      another ptr. It will make bpf_sk_release(tp) possible:
      
      	sk = bpf_sk_lookup_tcp();
      	/* if (!sk) ... */
      	tp = bpf_tcp_sock(sk);
      	/* if (!tp) ... */
      	bpf_sk_release(tp);
      
      A separate helper "bpf_get_listener_sock()" will be added in a later
      patch to do sk_to_full_sk().
      
      Misc change notes:
      - To allow bpf_sk_release(tp), the arg of bpf_sk_release() is changed
        from ARG_PTR_TO_SOCKET to ARG_PTR_TO_SOCK_COMMON.  ARG_PTR_TO_SOCKET
        is removed from bpf.h since no helper is using it.
      
      - arg_type_is_refcounted() is renamed to arg_type_may_be_refcounted()
        because ARG_PTR_TO_SOCK_COMMON is the only one and skb->sk is not
        refcounted.  All bpf_sk_release(), bpf_sk_fullsock() and bpf_tcp_sock()
        take ARG_PTR_TO_SOCK_COMMON.
      
      - check_refcount_ok() ensures is_acquire_function() cannot take
        arg_type_may_be_refcounted() as its argument.
      
      - The check_func_arg() can only allow one refcount-ed arg.  It is
        guaranteed by check_refcount_ok() which ensures at most one arg can be
        refcounted.  Hence, it is a verifier internal error if >1 refcount arg
        found in check_func_arg().
      
      - In release_reference(), release_reference_state() is called
        first to ensure a match on "reg->ref_obj_id" can be found before
        scrubbing the reg states with release_reg_references().
      
      - reg_is_refcounted() is no longer needed.
        1. In mark_ptr_or_null_regs(), its usage is replaced by
           "ref_obj_id && ref_obj_id == id" because,
           when is_null == true, release_reference_state() should only be
           called on the ref_obj_id obtained by a acquire helper (i.e.
           is_acquire_function() == true).  Otherwise, the following
           would happen:
      
      	sk = bpf_sk_lookup_tcp();
      	/* if (!sk) { ... } */
      	fullsock = bpf_sk_fullsock(sk);
      	if (!fullsock) {
      		/*
      		 * release_reference_state(fullsock_reg->ref_obj_id)
      		 * where fullsock_reg->ref_obj_id == sk_reg->ref_obj_id.
      		 *
      		 * Hence, the following bpf_sk_release(sk) will fail
      		 * because the ref state has already been released in the
      		 * earlier release_reference_state(fullsock_reg->ref_obj_id).
      		 */
      		bpf_sk_release(sk);
      	}
      
        2. In release_reg_references(), the current reg_is_refcounted() call
           is unnecessary because the id check is enough.
      
      - The type_is_refcounted() and type_is_refcounted_or_null()
        are no longer needed also because reg_is_refcounted() is removed.
      
      Fixes: 655a51e5 ("bpf: Add struct bpf_tcp_sock and BPF_FUNC_tcp_sock")
      Reported-by: NLorenz Bauer <lmb@cloudflare.com>
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      1b986589
  5. 02 2月, 2019 1 次提交
    • A
      bpf: introduce bpf_spin_lock · d83525ca
      Alexei Starovoitov 提交于
      Introduce 'struct bpf_spin_lock' and bpf_spin_lock/unlock() helpers to let
      bpf program serialize access to other variables.
      
      Example:
      struct hash_elem {
          int cnt;
          struct bpf_spin_lock lock;
      };
      struct hash_elem * val = bpf_map_lookup_elem(&hash_map, &key);
      if (val) {
          bpf_spin_lock(&val->lock);
          val->cnt++;
          bpf_spin_unlock(&val->lock);
      }
      
      Restrictions and safety checks:
      - bpf_spin_lock is only allowed inside HASH and ARRAY maps.
      - BTF description of the map is mandatory for safety analysis.
      - bpf program can take one bpf_spin_lock at a time, since two or more can
        cause dead locks.
      - only one 'struct bpf_spin_lock' is allowed per map element.
        It drastically simplifies implementation yet allows bpf program to use
        any number of bpf_spin_locks.
      - when bpf_spin_lock is taken the calls (either bpf2bpf or helpers) are not allowed.
      - bpf program must bpf_spin_unlock() before return.
      - bpf program can access 'struct bpf_spin_lock' only via
        bpf_spin_lock()/bpf_spin_unlock() helpers.
      - load/store into 'struct bpf_spin_lock lock;' field is not allowed.
      - to use bpf_spin_lock() helper the BTF description of map value must be
        a struct and have 'struct bpf_spin_lock anyname;' field at the top level.
        Nested lock inside another struct is not allowed.
      - syscall map_lookup doesn't copy bpf_spin_lock field to user space.
      - syscall map_update and program map_update do not update bpf_spin_lock field.
      - bpf_spin_lock cannot be on the stack or inside networking packet.
        bpf_spin_lock can only be inside HASH or ARRAY map value.
      - bpf_spin_lock is available to root only and to all program types.
      - bpf_spin_lock is not allowed in inner maps of map-in-map.
      - ld_abs is not allowed inside spin_lock-ed region.
      - tracing progs and socket filter progs cannot use bpf_spin_lock due to
        insufficient preemption checks
      
      Implementation details:
      - cgroup-bpf class of programs can nest with xdp/tc programs.
        Hence bpf_spin_lock is equivalent to spin_lock_irqsave.
        Other solutions to avoid nested bpf_spin_lock are possible.
        Like making sure that all networking progs run with softirq disabled.
        spin_lock_irqsave is the simplest and doesn't add overhead to the
        programs that don't use it.
      - arch_spinlock_t is used when its implemented as queued_spin_lock
      - archs can force their own arch_spinlock_t
      - on architectures where queued_spin_lock is not available and
        sizeof(arch_spinlock_t) != sizeof(__u32) trivial lock is used.
      - presence of bpf_spin_lock inside map value could have been indicated via
        extra flag during map_create, but specifying it via BTF is cleaner.
        It provides introspection for map key/value and reduces user mistakes.
      
      Next steps:
      - allow bpf_spin_lock in other map types (like cgroup local storage)
      - introduce BPF_F_LOCK flag for bpf_map_update() syscall and helper
        to request kernel to grab bpf_spin_lock before rewriting the value.
        That will serialize access to map elements.
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      d83525ca
  6. 24 1月, 2019 2 次提交
  7. 06 1月, 2019 1 次提交
  8. 03 1月, 2019 2 次提交
    • D
      bpf: prevent out of bounds speculation on pointer arithmetic · 979d63d5
      Daniel Borkmann 提交于
      Jann reported that the original commit back in b2157399
      ("bpf: prevent out-of-bounds speculation") was not sufficient
      to stop CPU from speculating out of bounds memory access:
      While b2157399 only focussed on masking array map access
      for unprivileged users for tail calls and data access such
      that the user provided index gets sanitized from BPF program
      and syscall side, there is still a more generic form affected
      from BPF programs that applies to most maps that hold user
      data in relation to dynamic map access when dealing with
      unknown scalars or "slow" known scalars as access offset, for
      example:
      
        - Load a map value pointer into R6
        - Load an index into R7
        - Do a slow computation (e.g. with a memory dependency) that
          loads a limit into R8 (e.g. load the limit from a map for
          high latency, then mask it to make the verifier happy)
        - Exit if R7 >= R8 (mispredicted branch)
        - Load R0 = R6[R7]
        - Load R0 = R6[R0]
      
      For unknown scalars there are two options in the BPF verifier
      where we could derive knowledge from in order to guarantee
      safe access to the memory: i) While </>/<=/>= variants won't
      allow to derive any lower or upper bounds from the unknown
      scalar where it would be safe to add it to the map value
      pointer, it is possible through ==/!= test however. ii) another
      option is to transform the unknown scalar into a known scalar,
      for example, through ALU ops combination such as R &= <imm>
      followed by R |= <imm> or any similar combination where the
      original information from the unknown scalar would be destroyed
      entirely leaving R with a constant. The initial slow load still
      precedes the latter ALU ops on that register, so the CPU
      executes speculatively from that point. Once we have the known
      scalar, any compare operation would work then. A third option
      only involving registers with known scalars could be crafted
      as described in [0] where a CPU port (e.g. Slow Int unit)
      would be filled with many dependent computations such that
      the subsequent condition depending on its outcome has to wait
      for evaluation on its execution port and thereby executing
      speculatively if the speculated code can be scheduled on a
      different execution port, or any other form of mistraining
      as described in [1], for example. Given this is not limited
      to only unknown scalars, not only map but also stack access
      is affected since both is accessible for unprivileged users
      and could potentially be used for out of bounds access under
      speculation.
      
      In order to prevent any of these cases, the verifier is now
      sanitizing pointer arithmetic on the offset such that any
      out of bounds speculation would be masked in a way where the
      pointer arithmetic result in the destination register will
      stay unchanged, meaning offset masked into zero similar as
      in array_index_nospec() case. With regards to implementation,
      there are three options that were considered: i) new insn
      for sanitation, ii) push/pop insn and sanitation as inlined
      BPF, iii) reuse of ax register and sanitation as inlined BPF.
      
      Option i) has the downside that we end up using from reserved
      bits in the opcode space, but also that we would require
      each JIT to emit masking as native arch opcodes meaning
      mitigation would have slow adoption till everyone implements
      it eventually which is counter-productive. Option ii) and iii)
      have both in common that a temporary register is needed in
      order to implement the sanitation as inlined BPF since we
      are not allowed to modify the source register. While a push /
      pop insn in ii) would be useful to have in any case, it
      requires once again that every JIT needs to implement it
      first. While possible, amount of changes needed would also
      be unsuitable for a -stable patch. Therefore, the path which
      has fewer changes, less BPF instructions for the mitigation
      and does not require anything to be changed in the JITs is
      option iii) which this work is pursuing. The ax register is
      already mapped to a register in all JITs (modulo arm32 where
      it's mapped to stack as various other BPF registers there)
      and used in constant blinding for JITs-only so far. It can
      be reused for verifier rewrites under certain constraints.
      The interpreter's tmp "register" has therefore been remapped
      into extending the register set with hidden ax register and
      reusing that for a number of instructions that needed the
      prior temporary variable internally (e.g. div, mod). This
      allows for zero increase in stack space usage in the interpreter,
      and enables (restricted) generic use in rewrites otherwise as
      long as such a patchlet does not make use of these instructions.
      The sanitation mask is dynamic and relative to the offset the
      map value or stack pointer currently holds.
      
      There are various cases that need to be taken under consideration
      for the masking, e.g. such operation could look as follows:
      ptr += val or val += ptr or ptr -= val. Thus, the value to be
      sanitized could reside either in source or in destination
      register, and the limit is different depending on whether
      the ALU op is addition or subtraction and depending on the
      current known and bounded offset. The limit is derived as
      follows: limit := max_value_size - (smin_value + off). For
      subtraction: limit := umax_value + off. This holds because
      we do not allow any pointer arithmetic that would
      temporarily go out of bounds or would have an unknown
      value with mixed signed bounds where it is unclear at
      verification time whether the actual runtime value would
      be either negative or positive. For example, we have a
      derived map pointer value with constant offset and bounded
      one, so limit based on smin_value works because the verifier
      requires that statically analyzed arithmetic on the pointer
      must be in bounds, and thus it checks if resulting
      smin_value + off and umax_value + off is still within map
      value bounds at time of arithmetic in addition to time of
      access. Similarly, for the case of stack access we derive
      the limit as follows: MAX_BPF_STACK + off for subtraction
      and -off for the case of addition where off := ptr_reg->off +
      ptr_reg->var_off.value. Subtraction is a special case for
      the masking which can be in form of ptr += -val, ptr -= -val,
      or ptr -= val. In the first two cases where we know that
      the value is negative, we need to temporarily negate the
      value in order to do the sanitation on a positive value
      where we later swap the ALU op, and restore original source
      register if the value was in source.
      
      The sanitation of pointer arithmetic alone is still not fully
      sufficient as is, since a scenario like the following could
      happen ...
      
        PTR += 0x1000 (e.g. K-based imm)
        PTR -= BIG_NUMBER_WITH_SLOW_COMPARISON
        PTR += 0x1000
        PTR -= BIG_NUMBER_WITH_SLOW_COMPARISON
        [...]
      
      ... which under speculation could end up as ...
      
        PTR += 0x1000
        PTR -= 0 [ truncated by mitigation ]
        PTR += 0x1000
        PTR -= 0 [ truncated by mitigation ]
        [...]
      
      ... and therefore still access out of bounds. To prevent such
      case, the verifier is also analyzing safety for potential out
      of bounds access under speculative execution. Meaning, it is
      also simulating pointer access under truncation. We therefore
      "branch off" and push the current verification state after the
      ALU operation with known 0 to the verification stack for later
      analysis. Given the current path analysis succeeded it is
      likely that the one under speculation can be pruned. In any
      case, it is also subject to existing complexity limits and
      therefore anything beyond this point will be rejected. In
      terms of pruning, it needs to be ensured that the verification
      state from speculative execution simulation must never prune
      a non-speculative execution path, therefore, we mark verifier
      state accordingly at the time of push_stack(). If verifier
      detects out of bounds access under speculative execution from
      one of the possible paths that includes a truncation, it will
      reject such program.
      
      Given we mask every reg-based pointer arithmetic for
      unprivileged programs, we've been looking into how it could
      affect real-world programs in terms of size increase. As the
      majority of programs are targeted for privileged-only use
      case, we've unconditionally enabled masking (with its alu
      restrictions on top of it) for privileged programs for the
      sake of testing in order to check i) whether they get rejected
      in its current form, and ii) by how much the number of
      instructions and size will increase. We've tested this by
      using Katran, Cilium and test_l4lb from the kernel selftests.
      For Katran we've evaluated balancer_kern.o, Cilium bpf_lxc.o
      and an older test object bpf_lxc_opt_-DUNKNOWN.o and l4lb
      we've used test_l4lb.o as well as test_l4lb_noinline.o. We
      found that none of the programs got rejected by the verifier
      with this change, and that impact is rather minimal to none.
      balancer_kern.o had 13,904 bytes (1,738 insns) xlated and
      7,797 bytes JITed before and after the change. Most complex
      program in bpf_lxc.o had 30,544 bytes (3,817 insns) xlated
      and 18,538 bytes JITed before and after and none of the other
      tail call programs in bpf_lxc.o had any changes either. For
      the older bpf_lxc_opt_-DUNKNOWN.o object we found a small
      increase from 20,616 bytes (2,576 insns) and 12,536 bytes JITed
      before to 20,664 bytes (2,582 insns) and 12,558 bytes JITed
      after the change. Other programs from that object file had
      similar small increase. Both test_l4lb.o had no change and
      remained at 6,544 bytes (817 insns) xlated and 3,401 bytes
      JITed and for test_l4lb_noinline.o constant at 5,080 bytes
      (634 insns) xlated and 3,313 bytes JITed. This can be explained
      in that LLVM typically optimizes stack based pointer arithmetic
      by using K-based operations and that use of dynamic map access
      is not overly frequent. However, in future we may decide to
      optimize the algorithm further under known guarantees from
      branch and value speculation. Latter seems also unclear in
      terms of prediction heuristics that today's CPUs apply as well
      as whether there could be collisions in e.g. the predictor's
      Value History/Pattern Table for triggering out of bounds access,
      thus masking is performed unconditionally at this point but could
      be subject to relaxation later on. We were generally also
      brainstorming various other approaches for mitigation, but the
      blocker was always lack of available registers at runtime and/or
      overhead for runtime tracking of limits belonging to a specific
      pointer. Thus, we found this to be minimally intrusive under
      given constraints.
      
      With that in place, a simple example with sanitized access on
      unprivileged load at post-verification time looks as follows:
      
        # bpftool prog dump xlated id 282
        [...]
        28: (79) r1 = *(u64 *)(r7 +0)
        29: (79) r2 = *(u64 *)(r7 +8)
        30: (57) r1 &= 15
        31: (79) r3 = *(u64 *)(r0 +4608)
        32: (57) r3 &= 1
        33: (47) r3 |= 1
        34: (2d) if r2 > r3 goto pc+19
        35: (b4) (u32) r11 = (u32) 20479  |
        36: (1f) r11 -= r2                | Dynamic sanitation for pointer
        37: (4f) r11 |= r2                | arithmetic with registers
        38: (87) r11 = -r11               | containing bounded or known
        39: (c7) r11 s>>= 63              | scalars in order to prevent
        40: (5f) r11 &= r2                | out of bounds speculation.
        41: (0f) r4 += r11                |
        42: (71) r4 = *(u8 *)(r4 +0)
        43: (6f) r4 <<= r1
        [...]
      
      For the case where the scalar sits in the destination register
      as opposed to the source register, the following code is emitted
      for the above example:
      
        [...]
        16: (b4) (u32) r11 = (u32) 20479
        17: (1f) r11 -= r2
        18: (4f) r11 |= r2
        19: (87) r11 = -r11
        20: (c7) r11 s>>= 63
        21: (5f) r2 &= r11
        22: (0f) r2 += r0
        23: (61) r0 = *(u32 *)(r2 +0)
        [...]
      
      JIT blinding example with non-conflicting use of r10:
      
        [...]
         d5:	je     0x0000000000000106    _
         d7:	mov    0x0(%rax),%edi       |
         da:	mov    $0xf153246,%r10d     | Index load from map value and
         e0:	xor    $0xf153259,%r10      | (const blinded) mask with 0x1f.
         e7:	and    %r10,%rdi            |_
         ea:	mov    $0x2f,%r10d          |
         f0:	sub    %rdi,%r10            | Sanitized addition. Both use r10
         f3:	or     %rdi,%r10            | but do not interfere with each
         f6:	neg    %r10                 | other. (Neither do these instructions
         f9:	sar    $0x3f,%r10           | interfere with the use of ax as temp
         fd:	and    %r10,%rdi            | in interpreter.)
        100:	add    %rax,%rdi            |_
        103:	mov    0x0(%rdi),%eax
       [...]
      
      Tested that it fixes Jann's reproducer, and also checked that test_verifier
      and test_progs suite with interpreter, JIT and JIT with hardening enabled
      on x86-64 and arm64 runs successfully.
      
        [0] Speculose: Analyzing the Security Implications of Speculative
            Execution in CPUs, Giorgi Maisuradze and Christian Rossow,
            https://arxiv.org/pdf/1801.04084.pdf
      
        [1] A Systematic Evaluation of Transient Execution Attacks and
            Defenses, Claudio Canella, Jo Van Bulck, Michael Schwarz,
            Moritz Lipp, Benjamin von Berg, Philipp Ortner, Frank Piessens,
            Dmitry Evtyushkin, Daniel Gruss,
            https://arxiv.org/pdf/1811.05441.pdf
      
      Fixes: b2157399 ("bpf: prevent out-of-bounds speculation")
      Reported-by: NJann Horn <jannh@google.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      979d63d5
    • D
      bpf: move {prev_,}insn_idx into verifier env · c08435ec
      Daniel Borkmann 提交于
      Move prev_insn_idx and insn_idx from the do_check() function into
      the verifier environment, so they can be read inside the various
      helper functions for handling the instructions. It's easier to put
      this into the environment rather than changing all call-sites only
      to pass it along. insn_idx is useful in particular since this later
      on allows to hold state in env->insn_aux_data[env->insn_idx].
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      c08435ec
  9. 15 12月, 2018 2 次提交
    • A
      bpf: add self-check logic to liveness analysis · 9242b5f5
      Alexei Starovoitov 提交于
      Introduce REG_LIVE_DONE to check the liveness propagation
      and prepare the states for merging.
      See algorithm description in clean_live_states().
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      9242b5f5
    • M
      bpf: verbose log bpf_line_info in verifier · d9762e84
      Martin KaFai Lau 提交于
      This patch adds bpf_line_info during the verifier's verbose.
      It can give error context for debug purpose.
      
      ~~~~~~~~~~
      Here is the verbose log for backedge:
      	while (a) {
      		a += bpf_get_smp_processor_id();
      		bpf_trace_printk(fmt, sizeof(fmt), a);
      	}
      
      ~> bpftool prog load ./test_loop.o /sys/fs/bpf/test_loop type tracepoint
      13: while (a) {
      3: a += bpf_get_smp_processor_id();
      back-edge from insn 13 to 3
      
      ~~~~~~~~~~
      Here is the verbose log for invalid pkt access:
      Modification to test_xdp_noinline.c:
      
      	data = (void *)(long)xdp->data;
      	data_end = (void *)(long)xdp->data_end;
      /*
      	if (data + 4 > data_end)
      		return XDP_DROP;
      */
      	*(u32 *)data = dst->dst;
      
      ~> bpftool prog load ./test_xdp_noinline.o /sys/fs/bpf/test_xdp_noinline type xdp
      ; data = (void *)(long)xdp->data;
      224: (79) r2 = *(u64 *)(r10 -112)
      225: (61) r2 = *(u32 *)(r2 +0)
      ; *(u32 *)data = dst->dst;
      226: (63) *(u32 *)(r2 +0) = r1
      invalid access to packet, off=0 size=4, R2(id=0,off=0,r=0)
      R2 offset is outside of the packet
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      d9762e84
  10. 10 12月, 2018 1 次提交
    • M
      bpf: Add bpf_line_info support · c454a46b
      Martin KaFai Lau 提交于
      This patch adds bpf_line_info support.
      
      It accepts an array of bpf_line_info objects during BPF_PROG_LOAD.
      The "line_info", "line_info_cnt" and "line_info_rec_size" are added
      to the "union bpf_attr".  The "line_info_rec_size" makes
      bpf_line_info extensible in the future.
      
      The new "check_btf_line()" ensures the userspace line_info is valid
      for the kernel to use.
      
      When the verifier is translating/patching the bpf_prog (through
      "bpf_patch_insn_single()"), the line_infos' insn_off is also
      adjusted by the newly added "bpf_adj_linfo()".
      
      If the bpf_prog is jited, this patch also provides the jited addrs (in
      aux->jited_linfo) for the corresponding line_info.insn_off.
      "bpf_prog_fill_jited_linfo()" is added to fill the aux->jited_linfo.
      It is currently called by the x86 jit.  Other jits can also use
      "bpf_prog_fill_jited_linfo()" and it will be done in the followup patches.
      In the future, if it deemed necessary, a particular jit could also provide
      its own "bpf_prog_fill_jited_linfo()" implementation.
      
      A few "*line_info*" fields are added to the bpf_prog_info such
      that the user can get the xlated line_info back (i.e. the line_info
      with its insn_off reflecting the translated prog).  The jited_line_info
      is available if the prog is jited.  It is an array of __u64.
      If the prog is not jited, jited_line_info_cnt is 0.
      
      The verifier's verbose log with line_info will be done in
      a follow up patch.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      c454a46b
  11. 27 11月, 2018 1 次提交
    • Y
      bpf: btf: support proper non-jit func info · ba64e7d8
      Yonghong Song 提交于
      Commit 838e9690 ("bpf: Introduce bpf_func_info")
      added bpf func info support. The userspace is able
      to get better ksym's for bpf programs with jit, and
      is able to print out func prototypes.
      
      For a program containing func-to-func calls, the existing
      implementation returns user specified number of function
      calls and BTF types if jit is enabled. If the jit is not
      enabled, it only returns the type for the main function.
      
      This is undesirable. Interpreter may still be used
      and we should keep feature identical regardless of
      whether jit is enabled or not.
      This patch fixed this discrepancy.
      
      Fixes: 838e9690 ("bpf: Introduce bpf_func_info")
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      ba64e7d8
  12. 21 11月, 2018 1 次提交
    • Y
      bpf: Introduce bpf_func_info · 838e9690
      Yonghong Song 提交于
      This patch added interface to load a program with the following
      additional information:
         . prog_btf_fd
         . func_info, func_info_rec_size and func_info_cnt
      where func_info will provide function range and type_id
      corresponding to each function.
      
      The func_info_rec_size is introduced in the UAPI to specify
      struct bpf_func_info size passed from user space. This
      intends to make bpf_func_info structure growable in the future.
      If the kernel gets a different bpf_func_info size from userspace,
      it will try to handle user request with part of bpf_func_info
      it can understand. In this patch, kernel can understand
        struct bpf_func_info {
             __u32   insn_offset;
             __u32   type_id;
        };
      If user passed a bpf func_info record size of 16 bytes, the
      kernel can still handle part of records with the above definition.
      
      If verifier agrees with function range provided by the user,
      the bpf_prog ksym for each function will use the func name
      provided in the type_id, which is supposed to provide better
      encoding as it is not limited by 16 bytes program name
      limitation and this is better for bpf program which contains
      multiple subprograms.
      
      The bpf_prog_info interface is also extended to
      return btf_id, func_info, func_info_rec_size and func_info_cnt
      to userspace, so userspace can print out the function prototype
      for each xlated function. The insn_offset in the returned
      func_info corresponds to the insn offset for xlated functions.
      With other jit related fields in bpf_prog_info, userspace can also
      print out function prototypes for each jited function.
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      838e9690
  13. 11 11月, 2018 1 次提交
  14. 01 11月, 2018 1 次提交
  15. 08 10月, 2018 1 次提交
  16. 03 10月, 2018 3 次提交
  17. 30 8月, 2018 1 次提交
    • E
      bpf/verifier: per-register parent pointers · 679c782d
      Edward Cree 提交于
      By giving each register its own liveness chain, we elide the skip_callee()
       logic.  Instead, each register's parent is the state it inherits from;
       both check_func_call() and prepare_func_exit() automatically connect
       reg states to the correct chain since when they copy the reg state across
       (r1-r5 into the callee as args, and r0 out as the return value) they also
       copy the parent pointer.
      Signed-off-by: NEdward Cree <ecree@solarflare.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      679c782d
  18. 24 5月, 2018 1 次提交
    • D
      bpf: properly enforce index mask to prevent out-of-bounds speculation · c93552c4
      Daniel Borkmann 提交于
      While reviewing the verifier code, I recently noticed that the
      following two program variants in relation to tail calls can be
      loaded.
      
      Variant 1:
      
        # bpftool p d x i 15
          0: (15) if r1 == 0x0 goto pc+3
          1: (18) r2 = map[id:5]
          3: (05) goto pc+2
          4: (18) r2 = map[id:6]
          6: (b7) r3 = 7
          7: (35) if r3 >= 0xa0 goto pc+2
          8: (54) (u32) r3 &= (u32) 255
          9: (85) call bpf_tail_call#12
         10: (b7) r0 = 1
         11: (95) exit
      
        # bpftool m s i 5
          5: prog_array  flags 0x0
              key 4B  value 4B  max_entries 4  memlock 4096B
        # bpftool m s i 6
          6: prog_array  flags 0x0
              key 4B  value 4B  max_entries 160  memlock 4096B
      
      Variant 2:
      
        # bpftool p d x i 20
          0: (15) if r1 == 0x0 goto pc+3
          1: (18) r2 = map[id:8]
          3: (05) goto pc+2
          4: (18) r2 = map[id:7]
          6: (b7) r3 = 7
          7: (35) if r3 >= 0x4 goto pc+2
          8: (54) (u32) r3 &= (u32) 3
          9: (85) call bpf_tail_call#12
         10: (b7) r0 = 1
         11: (95) exit
      
        # bpftool m s i 8
          8: prog_array  flags 0x0
              key 4B  value 4B  max_entries 160  memlock 4096B
        # bpftool m s i 7
          7: prog_array  flags 0x0
              key 4B  value 4B  max_entries 4  memlock 4096B
      
      In both cases the index masking inserted by the verifier in order
      to control out of bounds speculation from a CPU via b2157399
      ("bpf: prevent out-of-bounds speculation") seems to be incorrect
      in what it is enforcing. In the 1st variant, the mask is applied
      from the map with the significantly larger number of entries where
      we would allow to a certain degree out of bounds speculation for
      the smaller map, and in the 2nd variant where the mask is applied
      from the map with the smaller number of entries, we get buggy
      behavior since we truncate the index of the larger map.
      
      The original intent from commit b2157399 is to reject such
      occasions where two or more different tail call maps are used
      in the same tail call helper invocation. However, the check on
      the BPF_MAP_PTR_POISON is never hit since we never poisoned the
      saved pointer in the first place! We do this explicitly for map
      lookups but in case of tail calls we basically used the tail
      call map in insn_aux_data that was processed in the most recent
      path which the verifier walked. Thus any prior path that stored
      a pointer in insn_aux_data at the helper location was always
      overridden.
      
      Fix it by moving the map pointer poison logic into a small helper
      that covers both BPF helpers with the same logic. After that in
      fixup_bpf_calls() the poison check is then hit for tail calls
      and the program rejected. Latter only happens in unprivileged
      case since this is the *only* occasion where a rewrite needs to
      happen, and where such rewrite is specific to the map (max_entries,
      index_mask). In the privileged case the rewrite is generic for
      the insn->imm / insn->code update so multiple maps from different
      paths can be handled just fine since all the remaining logic
      happens in the instruction processing itself. This is similar
      to the case of map lookups: in case there is a collision of
      maps in fixup_bpf_calls() we must skip the inlined rewrite since
      this will turn the generic instruction sequence into a non-
      generic one. Thus the patch_call_imm will simply update the
      insn->imm location where the bpf_map_lookup_elem() will later
      take care of the dispatch. Given we need this 'poison' state
      as a check, the information of whether a map is an unpriv_array
      gets lost, so enforcing it prior to that needs an additional
      state. In general this check is needed since there are some
      complex and tail call intensive BPF programs out there where
      LLVM tends to generate such code occasionally. We therefore
      convert the map_ptr rather into map_state to store all this
      w/o extra memory overhead, and the bit whether one of the maps
      involved in the collision was from an unpriv_array thus needs
      to be retained as well there.
      
      Fixes: b2157399 ("bpf: prevent out-of-bounds speculation")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      c93552c4
  19. 20 5月, 2018 1 次提交
    • A
      bpf: Prevent memory disambiguation attack · af86ca4e
      Alexei Starovoitov 提交于
      Detect code patterns where malicious 'speculative store bypass' can be used
      and sanitize such patterns.
      
       39: (bf) r3 = r10
       40: (07) r3 += -216
       41: (79) r8 = *(u64 *)(r7 +0)   // slow read
       42: (7a) *(u64 *)(r10 -72) = 0  // verifier inserts this instruction
       43: (7b) *(u64 *)(r8 +0) = r3   // this store becomes slow due to r8
       44: (79) r1 = *(u64 *)(r6 +0)   // cpu speculatively executes this load
       45: (71) r2 = *(u8 *)(r1 +0)    // speculatively arbitrary 'load byte'
                                       // is now sanitized
      
      Above code after x86 JIT becomes:
       e5: mov    %rbp,%rdx
       e8: add    $0xffffffffffffff28,%rdx
       ef: mov    0x0(%r13),%r14
       f3: movq   $0x0,-0x48(%rbp)
       fb: mov    %rdx,0x0(%r14)
       ff: mov    0x0(%rbx),%rdi
      103: movzbq 0x0(%rdi),%rsi
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      af86ca4e
  20. 17 5月, 2018 1 次提交
  21. 04 5月, 2018 2 次提交
  22. 26 3月, 2018 2 次提交
  23. 10 1月, 2018 1 次提交
    • Q
      bpf: export function to write into verifier log buffer · 430e68d1
      Quentin Monnet 提交于
      Rename the BPF verifier `verbose()` to `bpf_verifier_log_write()` and
      export it, so that other components (in particular, drivers for BPF
      offload) can reuse the user buffer log to dump error messages at
      verification time.
      
      Renaming `verbose()` was necessary in order to avoid a name so generic
      to be exported to the global namespace. However to prevent too much pain
      for backports, the calls to `verbose()` in the kernel BPF verifier were
      not changed. Instead, use function aliasing to make `verbose` point to
      `bpf_verifier_log_write`. Another solution could consist in making a
      wrapper around `verbose()`, but since it is a variadic function, I don't
      see a clean way without creating two identical wrappers, one for the
      verifier and one to export.
      Signed-off-by: NQuentin Monnet <quentin.monnet@netronome.com>
      Reviewed-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      430e68d1
  24. 31 12月, 2017 1 次提交
  25. 28 12月, 2017 1 次提交
  26. 21 12月, 2017 1 次提交
    • A
      bpf: fix integer overflows · bb7f0f98
      Alexei Starovoitov 提交于
      There were various issues related to the limited size of integers used in
      the verifier:
       - `off + size` overflow in __check_map_access()
       - `off + reg->off` overflow in check_mem_access()
       - `off + reg->var_off.value` overflow or 32-bit truncation of
         `reg->var_off.value` in check_mem_access()
       - 32-bit truncation in check_stack_boundary()
      
      Make sure that any integer math cannot overflow by not allowing
      pointer math with large values.
      
      Also reduce the scope of "scalar op scalar" tracking.
      
      Fixes: f1174f77 ("bpf/verifier: rework value tracking")
      Reported-by: NJann Horn <jannh@google.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      bb7f0f98
  27. 18 12月, 2017 4 次提交
    • A
      bpf: x64: add JIT support for multi-function programs · 1c2a088a
      Alexei Starovoitov 提交于
      Typical JIT does several passes over bpf instructions to
      compute total size and relative offsets of jumps and calls.
      With multitple bpf functions calling each other all relative calls
      will have invalid offsets intially therefore we need to additional
      last pass over the program to emit calls with correct offsets.
      For example in case of three bpf functions:
      main:
        call foo
        call bpf_map_lookup
        exit
      foo:
        call bar
        exit
      bar:
        exit
      
      We will call bpf_int_jit_compile() indepedently for main(), foo() and bar()
      x64 JIT typically does 4-5 passes to converge.
      After these initial passes the image for these 3 functions
      will be good except call targets, since start addresses of
      foo() and bar() are unknown when we were JITing main()
      (note that call bpf_map_lookup will be resolved properly
      during initial passes).
      Once start addresses of 3 functions are known we patch
      call_insn->imm to point to right functions and call
      bpf_int_jit_compile() again which needs only one pass.
      Additional safety checks are done to make sure this
      last pass doesn't produce image that is larger or smaller
      than previous pass.
      
      When constant blinding is on it's applied to all functions
      at the first pass, since doing it once again at the last
      pass can change size of the JITed code.
      
      Tested on x64 and arm64 hw with JIT on/off, blinding on/off.
      x64 jits bpf-to-bpf calls correctly while arm64 falls back to interpreter.
      All other JITs that support normal BPF_CALL will behave the same way
      since bpf-to-bpf call is equivalent to bpf-to-kernel call from
      JITs point of view.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      1c2a088a
    • A
      bpf: teach verifier to recognize zero initialized stack · cc2b14d5
      Alexei Starovoitov 提交于
      programs with function calls are often passing various
      pointers via stack. When all calls are inlined llvm
      flattens stack accesses and optimizes away extra branches.
      When functions are not inlined it becomes the job of
      the verifier to recognize zero initialized stack to avoid
      exploring paths that program will not take.
      The following program would fail otherwise:
      
      ptr = &buffer_on_stack;
      *ptr = 0;
      ...
      func_call(.., ptr, ...) {
        if (..)
          *ptr = bpf_map_lookup();
      }
      ...
      if (*ptr != 0) {
        // Access (*ptr)->field is valid.
        // Without stack_zero tracking such (*ptr)->field access
        // will be rejected
      }
      
      since stack slots are no longer uniform invalid | spill | misc
      add liveness marking to all slots, but do it in 8 byte chunks.
      So if nothing was read or written in [fp-16, fp-9] range
      it will be marked as LIVE_NONE.
      If any byte in that range was read, it will be marked LIVE_READ
      and stacksafe() check will perform byte-by-byte verification.
      If all bytes in the range were written the slot will be
      marked as LIVE_WRITTEN.
      This significantly speeds up state equality comparison
      and reduces total number of states processed.
      
                          before   after
      bpf_lb-DLB_L3.o       2051    2003
      bpf_lb-DLB_L4.o       3287    3164
      bpf_lb-DUNKNOWN.o     1080    1080
      bpf_lxc-DDROP_ALL.o   24980   12361
      bpf_lxc-DUNKNOWN.o    34308   16605
      bpf_netdev.o          15404   10962
      bpf_overlay.o         7191    6679
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      cc2b14d5
    • A
      bpf: introduce function calls (verification) · f4d7e40a
      Alexei Starovoitov 提交于
      Allow arbitrary function calls from bpf function to another bpf function.
      
      To recognize such set of bpf functions the verifier does:
      1. runs control flow analysis to detect function boundaries
      2. proceeds with verification of all functions starting from main(root) function
      It recognizes that the stack of the caller can be accessed by the callee
      (if the caller passed a pointer to its stack to the callee) and the callee
      can store map_value and other pointers into the stack of the caller.
      3. keeps track of the stack_depth of each function to make sure that total
      stack depth is still less than 512 bytes
      4. disallows pointers to the callee stack to be stored into the caller stack,
      since they will be invalid as soon as the callee returns
      5. to reuse all of the existing state_pruning logic each function call
      is considered to be independent call from the verifier point of view.
      The verifier pretends to inline all function calls it sees are being called.
      It stores the callsite instruction index as part of the state to make sure
      that two calls to the same callee from two different places in the caller
      will be different from state pruning point of view
      6. more safety checks are added to liveness analysis
      
      Implementation details:
      . struct bpf_verifier_state is now consists of all stack frames that
        led to this function
      . struct bpf_func_state represent one stack frame. It consists of
        registers in the given frame and its stack
      . propagate_liveness() logic had a premature optimization where
        mark_reg_read() and mark_stack_slot_read() were manually inlined
        with loop iterating over parents for each register or stack slot.
        Undo this optimization to reuse more complex mark_*_read() logic
      . skip_callee() logic is not necessary from safety point of view,
        but without it mark_*_read() markings become too conservative,
        since after returning from the funciton call a read of r6-r9
        will incorrectly propagate the read marks into callee causing
        inefficient pruning later
      . mark_*_read() logic is now aware of control flow which makes it
        more complex. In the future the plan is to rewrite liveness
        to be hierarchical. So that liveness can be done within
        basic block only and control flow will be responsible for
        propagation of liveness information along cfg and between calls.
      . tail_calls and ld_abs insns are not allowed in the programs with
        bpf-to-bpf calls
      . returning stack pointers to the caller or storing them into stack
        frame of the caller is not allowed
      
      Testing:
      . no difference in cilium processed_insn numbers
      . large number of tests follows in next patches
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      f4d7e40a
    • A
      bpf: introduce function calls (function boundaries) · cc8b0b92
      Alexei Starovoitov 提交于
      Allow arbitrary function calls from bpf function to another bpf function.
      
      Since the beginning of bpf all bpf programs were represented as a single function
      and program authors were forced to use always_inline for all functions
      in their C code. That was causing llvm to unnecessary inflate the code size
      and forcing developers to move code to header files with little code reuse.
      
      With a bit of additional complexity teach verifier to recognize
      arbitrary function calls from one bpf function to another as long as
      all of functions are presented to the verifier as a single bpf program.
      New program layout:
      r6 = r1    // some code
      ..
      r1 = ..    // arg1
      r2 = ..    // arg2
      call pc+1  // function call pc-relative
      exit
      .. = r1    // access arg1
      .. = r2    // access arg2
      ..
      call pc+20 // second level of function call
      ...
      
      It allows for better optimized code and finally allows to introduce
      the core bpf libraries that can be reused in different projects,
      since programs are no longer limited by single elf file.
      With function calls bpf can be compiled into multiple .o files.
      
      This patch is the first step. It detects programs that contain
      multiple functions and checks that calls between them are valid.
      It splits the sequence of bpf instructions (one program) into a set
      of bpf functions that call each other. Calls to only known
      functions are allowed. In the future the verifier may allow
      calls to unresolved functions and will do dynamic linking.
      This logic supports statically linked bpf functions only.
      
      Such function boundary detection could have been done as part of
      control flow graph building in check_cfg(), but it's cleaner to
      separate function boundary detection vs control flow checks within
      a subprogram (function) into logically indepedent steps.
      Follow up patches may split check_cfg() further, but not check_subprogs().
      
      Only allow bpf-to-bpf calls for root only and for non-hw-offloaded programs.
      These restrictions can be relaxed in the future.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      cc8b0b92
  28. 23 11月, 2017 1 次提交
    • A
      bpf: fix branch pruning logic · c131187d
      Alexei Starovoitov 提交于
      when the verifier detects that register contains a runtime constant
      and it's compared with another constant it will prune exploration
      of the branch that is guaranteed not to be taken at runtime.
      This is all correct, but malicious program may be constructed
      in such a way that it always has a constant comparison and
      the other branch is never taken under any conditions.
      In this case such path through the program will not be explored
      by the verifier. It won't be taken at run-time either, but since
      all instructions are JITed the malicious program may cause JITs
      to complain about using reserved fields, etc.
      To fix the issue we have to track the instructions explored by
      the verifier and sanitize instructions that are dead at run time
      with NOPs. We cannot reject such dead code, since llvm generates
      it for valid C code, since it doesn't do as much data flow
      analysis as the verifier does.
      
      Fixes: 17a52670 ("bpf: verifier (add verifier core)")
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      c131187d
  29. 21 11月, 2017 1 次提交