1. 03 6月, 2018 2 次提交
    • D
      bpf: avoid retpoline for lookup/update/delete calls on maps · 09772d92
      Daniel Borkmann 提交于
      While some of the BPF map lookup helpers provide a ->map_gen_lookup()
      callback for inlining the map lookup altogether it is not available
      for every map, so the remaining ones have to call bpf_map_lookup_elem()
      helper which does a dispatch to map->ops->map_lookup_elem(). In
      times of retpolines, this will control and trap speculative execution
      rather than letting it do its work for the indirect call and will
      therefore cause a slowdown. Likewise, bpf_map_update_elem() and
      bpf_map_delete_elem() do not have an inlined version and need to call
      into their map->ops->map_update_elem() resp. map->ops->map_delete_elem()
      handlers.
      
      Before:
      
        # bpftool prog dump xlated id 1
          0: (bf) r2 = r10
          1: (07) r2 += -8
          2: (7a) *(u64 *)(r2 +0) = 0
          3: (18) r1 = map[id:1]
          5: (85) call __htab_map_lookup_elem#232656
          6: (15) if r0 == 0x0 goto pc+4
          7: (71) r1 = *(u8 *)(r0 +35)
          8: (55) if r1 != 0x0 goto pc+1
          9: (72) *(u8 *)(r0 +35) = 1
         10: (07) r0 += 56
         11: (15) if r0 == 0x0 goto pc+4
         12: (bf) r2 = r0
         13: (18) r1 = map[id:1]
         15: (85) call bpf_map_delete_elem#215008  <-- indirect call via
         16: (95) exit                                 helper
      
      After:
      
        # bpftool prog dump xlated id 1
          0: (bf) r2 = r10
          1: (07) r2 += -8
          2: (7a) *(u64 *)(r2 +0) = 0
          3: (18) r1 = map[id:1]
          5: (85) call __htab_map_lookup_elem#233328
          6: (15) if r0 == 0x0 goto pc+4
          7: (71) r1 = *(u8 *)(r0 +35)
          8: (55) if r1 != 0x0 goto pc+1
          9: (72) *(u8 *)(r0 +35) = 1
         10: (07) r0 += 56
         11: (15) if r0 == 0x0 goto pc+4
         12: (bf) r2 = r0
         13: (18) r1 = map[id:1]
         15: (85) call htab_lru_map_delete_elem#238240  <-- direct call
         16: (95) exit
      
      In all three lookup/update/delete cases however we can use the actual
      address of the map callback directly if we find that there's only a
      single path with a map pointer leading to the helper call, meaning
      when the map pointer has not been poisoned from verifier side.
      Example code can be seen above for the delete case.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      09772d92
    • D
      bpf: fixup error message from gpl helpers on license mismatch · 3fe2867c
      Daniel Borkmann 提交于
      Stating 'proprietary program' in the error is just silly since it
      can also be a different open source license than that which is just
      not compatible.
      
      Reference: https://twitter.com/majek04/status/998531268039102465Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      3fe2867c
  2. 24 5月, 2018 4 次提交
    • D
      bpf: properly enforce index mask to prevent out-of-bounds speculation · c93552c4
      Daniel Borkmann 提交于
      While reviewing the verifier code, I recently noticed that the
      following two program variants in relation to tail calls can be
      loaded.
      
      Variant 1:
      
        # bpftool p d x i 15
          0: (15) if r1 == 0x0 goto pc+3
          1: (18) r2 = map[id:5]
          3: (05) goto pc+2
          4: (18) r2 = map[id:6]
          6: (b7) r3 = 7
          7: (35) if r3 >= 0xa0 goto pc+2
          8: (54) (u32) r3 &= (u32) 255
          9: (85) call bpf_tail_call#12
         10: (b7) r0 = 1
         11: (95) exit
      
        # bpftool m s i 5
          5: prog_array  flags 0x0
              key 4B  value 4B  max_entries 4  memlock 4096B
        # bpftool m s i 6
          6: prog_array  flags 0x0
              key 4B  value 4B  max_entries 160  memlock 4096B
      
      Variant 2:
      
        # bpftool p d x i 20
          0: (15) if r1 == 0x0 goto pc+3
          1: (18) r2 = map[id:8]
          3: (05) goto pc+2
          4: (18) r2 = map[id:7]
          6: (b7) r3 = 7
          7: (35) if r3 >= 0x4 goto pc+2
          8: (54) (u32) r3 &= (u32) 3
          9: (85) call bpf_tail_call#12
         10: (b7) r0 = 1
         11: (95) exit
      
        # bpftool m s i 8
          8: prog_array  flags 0x0
              key 4B  value 4B  max_entries 160  memlock 4096B
        # bpftool m s i 7
          7: prog_array  flags 0x0
              key 4B  value 4B  max_entries 4  memlock 4096B
      
      In both cases the index masking inserted by the verifier in order
      to control out of bounds speculation from a CPU via b2157399
      ("bpf: prevent out-of-bounds speculation") seems to be incorrect
      in what it is enforcing. In the 1st variant, the mask is applied
      from the map with the significantly larger number of entries where
      we would allow to a certain degree out of bounds speculation for
      the smaller map, and in the 2nd variant where the mask is applied
      from the map with the smaller number of entries, we get buggy
      behavior since we truncate the index of the larger map.
      
      The original intent from commit b2157399 is to reject such
      occasions where two or more different tail call maps are used
      in the same tail call helper invocation. However, the check on
      the BPF_MAP_PTR_POISON is never hit since we never poisoned the
      saved pointer in the first place! We do this explicitly for map
      lookups but in case of tail calls we basically used the tail
      call map in insn_aux_data that was processed in the most recent
      path which the verifier walked. Thus any prior path that stored
      a pointer in insn_aux_data at the helper location was always
      overridden.
      
      Fix it by moving the map pointer poison logic into a small helper
      that covers both BPF helpers with the same logic. After that in
      fixup_bpf_calls() the poison check is then hit for tail calls
      and the program rejected. Latter only happens in unprivileged
      case since this is the *only* occasion where a rewrite needs to
      happen, and where such rewrite is specific to the map (max_entries,
      index_mask). In the privileged case the rewrite is generic for
      the insn->imm / insn->code update so multiple maps from different
      paths can be handled just fine since all the remaining logic
      happens in the instruction processing itself. This is similar
      to the case of map lookups: in case there is a collision of
      maps in fixup_bpf_calls() we must skip the inlined rewrite since
      this will turn the generic instruction sequence into a non-
      generic one. Thus the patch_call_imm will simply update the
      insn->imm location where the bpf_map_lookup_elem() will later
      take care of the dispatch. Given we need this 'poison' state
      as a check, the information of whether a map is an unpriv_array
      gets lost, so enforcing it prior to that needs an additional
      state. In general this check is needed since there are some
      complex and tail call intensive BPF programs out there where
      LLVM tends to generate such code occasionally. We therefore
      convert the map_ptr rather into map_state to store all this
      w/o extra memory overhead, and the bit whether one of the maps
      involved in the collision was from an unpriv_array thus needs
      to be retained as well there.
      
      Fixes: b2157399 ("bpf: prevent out-of-bounds speculation")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      c93552c4
    • M
      ipv6: sr: Add seg6local action End.BPF · 004d4b27
      Mathieu Xhonneux 提交于
      This patch adds the End.BPF action to the LWT seg6local infrastructure.
      This action works like any other seg6local End action, meaning that an IPv6
      header with SRH is needed, whose DA has to be equal to the SID of the
      action. It will also advance the SRH to the next segment, the BPF program
      does not have to take care of this.
      
      Since the BPF program may not be a source of instability in the kernel, it
      is important to ensure that the integrity of the packet is maintained
      before yielding it back to the IPv6 layer. The hook hence keeps track if
      the SRH has been altered through the helpers, and re-validates its
      content if needed with seg6_validate_srh. The state kept for validation is
      stored in a per-CPU buffer. The BPF program is not allowed to directly
      write into the packet, and only some fields of the SRH can be altered
      through the helper bpf_lwt_seg6_store_bytes.
      
      Performances profiling has shown that the SRH re-validation does not induce
      a significant overhead. If the altered SRH is deemed as invalid, the packet
      is dropped.
      
      This validation is also done before executing any action through
      bpf_lwt_seg6_action, and will not be performed again if the SRH is not
      modified after calling the action.
      
      The BPF program may return 3 types of return codes:
          - BPF_OK: the End.BPF action will look up the next destination through
                   seg6_lookup_nexthop.
          - BPF_REDIRECT: if an action has been executed through the
                bpf_lwt_seg6_action helper, the BPF program should return this
                value, as the skb's destination is already set and the default
                lookup should not be performed.
          - BPF_DROP : the packet will be dropped.
      Signed-off-by: NMathieu Xhonneux <m.xhonneux@gmail.com>
      Acked-by: NDavid Lebrun <dlebrun@google.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      004d4b27
    • S
      bpf: get kernel symbol addresses via syscall · dbecd738
      Sandipan Das 提交于
      This adds new two new fields to struct bpf_prog_info. For
      multi-function programs, these fields can be used to pass
      a list of kernel symbol addresses for all functions in a
      given program to userspace using the bpf system call with
      the BPF_OBJ_GET_INFO_BY_FD command.
      
      When bpf_jit_kallsyms is enabled, we can get the address
      of the corresponding kernel symbol for a callee function
      and resolve the symbol's name. The address is determined
      by adding the value of the call instruction's imm field
      to __bpf_call_base. This offset gets assigned to the imm
      field by the verifier.
      
      For some architectures, such as powerpc64, the imm field
      is not large enough to hold this offset.
      
      We resolve this by:
      
      [1] Assigning the subprog id to the imm field of a call
          instruction in the verifier instead of the offset of
          the callee's symbol's address from __bpf_call_base.
      
      [2] Determining the address of a callee's corresponding
          symbol by using the imm field as an index for the
          list of kernel symbol addresses now available from
          the program info.
      Suggested-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NSandipan Das <sandipan@linux.vnet.ibm.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      dbecd738
    • S
      bpf: support 64-bit offsets for bpf function calls · 2162fed4
      Sandipan Das 提交于
      The imm field of a bpf instruction is a signed 32-bit integer.
      For JITed bpf-to-bpf function calls, it holds the offset of the
      start address of the callee's JITed image from __bpf_call_base.
      
      For some architectures, such as powerpc64, this offset may be
      as large as 64 bits and cannot be accomodated in the imm field
      without truncation.
      
      We resolve this by:
      
      [1] Additionally using the auxiliary data of each function to
          keep a list of start addresses of the JITed images for all
          functions determined by the verifier.
      
      [2] Retaining the subprog id inside the off field of the call
          instructions and using it to index into the list mentioned
          above and lookup the callee's address.
      
      To make sure that the existing JIT compilers continue to work
      without requiring changes, we keep the imm field as it is.
      Signed-off-by: NSandipan Das <sandipan@linux.vnet.ibm.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      2162fed4
  3. 20 5月, 2018 1 次提交
    • A
      bpf: Prevent memory disambiguation attack · af86ca4e
      Alexei Starovoitov 提交于
      Detect code patterns where malicious 'speculative store bypass' can be used
      and sanitize such patterns.
      
       39: (bf) r3 = r10
       40: (07) r3 += -216
       41: (79) r8 = *(u64 *)(r7 +0)   // slow read
       42: (7a) *(u64 *)(r10 -72) = 0  // verifier inserts this instruction
       43: (7b) *(u64 *)(r8 +0) = r3   // this store becomes slow due to r8
       44: (79) r1 = *(u64 *)(r6 +0)   // cpu speculatively executes this load
       45: (71) r2 = *(u8 *)(r1 +0)    // speculatively arbitrary 'load byte'
                                       // is now sanitized
      
      Above code after x86 JIT becomes:
       e5: mov    %rbp,%rdx
       e8: add    $0xffffffffffffff28,%rdx
       ef: mov    0x0(%r13),%r14
       f3: movq   $0x0,-0x48(%rbp)
       fb: mov    %rdx,0x0(%r14)
       ff: mov    0x0(%rbx),%rdi
      103: movzbq 0x0(%rdi),%rsi
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      af86ca4e
  4. 16 5月, 2018 1 次提交
  5. 10 5月, 2018 1 次提交
  6. 05 5月, 2018 2 次提交
  7. 04 5月, 2018 5 次提交
    • J
      bpf: add faked "ending" subprog · 4cb3d99c
      Jiong Wang 提交于
      There are quite a few code snippet like the following in verifier:
      
             subprog_start = 0;
             if (env->subprog_cnt == cur_subprog + 1)
                     subprog_end = insn_cnt;
             else
                     subprog_end = env->subprog_info[cur_subprog + 1].start;
      
      The reason is there is no marker in subprog_info array to tell the end of
      it.
      
      We could resolve this issue by introducing a faked "ending" subprog.
      The special "ending" subprog is with "insn_cnt" as start offset, so it is
      serving as the end mark whenever we iterate over all subprogs.
      Signed-off-by: NJiong Wang <jiong.wang@netronome.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      4cb3d99c
    • J
      bpf: centre subprog information fields · 9c8105bd
      Jiong Wang 提交于
      It is better to centre all subprog information fields into one structure.
      This structure could later serve as function node in call graph.
      Signed-off-by: NJiong Wang <jiong.wang@netronome.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      9c8105bd
    • J
      bpf: unify main prog and subprog · f910cefa
      Jiong Wang 提交于
      Currently, verifier treat main prog and subprog differently. All subprogs
      detected are kept in env->subprog_starts while main prog is not kept there.
      Instead, main prog is implicitly defined as the prog start at 0.
      
      There is actually no difference between main prog and subprog, it is better
      to unify them, and register all progs detected into env->subprog_starts.
      
      This could also help simplifying some code logic.
      Signed-off-by: NJiong Wang <jiong.wang@netronome.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      f910cefa
    • D
      bpf: implement ld_abs/ld_ind in native bpf · e0cea7ce
      Daniel Borkmann 提交于
      The main part of this work is to finally allow removal of LD_ABS
      and LD_IND from the BPF core by reimplementing them through native
      eBPF instead. Both LD_ABS/LD_IND were carried over from cBPF and
      keeping them around in native eBPF caused way more trouble than
      actually worth it. To just list some of the security issues in
      the past:
      
        * fdfaf64e ("x86: bpf_jit: support negative offsets")
        * 35607b02 ("sparc: bpf_jit: fix loads from negative offsets")
        * e0ee9c12 ("x86: bpf_jit: fix two bugs in eBPF JIT compiler")
        * 07aee943 ("bpf, sparc: fix usage of wrong reg for load_skb_regs after call")
        * 6d59b7db ("bpf, s390x: do not reload skb pointers in non-skb context")
        * 87338c8e ("bpf, ppc64: do not reload skb pointers in non-skb context")
      
      For programs in native eBPF, LD_ABS/LD_IND are pretty much legacy
      these days due to their limitations and more efficient/flexible
      alternatives that have been developed over time such as direct
      packet access. LD_ABS/LD_IND only cover 1/2/4 byte loads into a
      register, the load happens in host endianness and its exception
      handling can yield unexpected behavior. The latter is explained
      in depth in f6b1b3bf ("bpf: fix subprog verifier bypass by
      div/mod by 0 exception") with similar cases of exceptions we had.
      In native eBPF more recent program types will disable LD_ABS/LD_IND
      altogether through may_access_skb() in verifier, and given the
      limitations in terms of exception handling, it's also disabled
      in programs that use BPF to BPF calls.
      
      In terms of cBPF, the LD_ABS/LD_IND is used in networking programs
      to access packet data. It is not used in seccomp-BPF but programs
      that use it for socket filtering or reuseport for demuxing with
      cBPF. This is mostly relevant for applications that have not yet
      migrated to native eBPF.
      
      The main complexity and source of bugs in LD_ABS/LD_IND is coming
      from their implementation in the various JITs. Most of them keep
      the model around from cBPF times by implementing a fastpath written
      in asm. They use typically two from the BPF program hidden CPU
      registers for caching the skb's headlen (skb->len - skb->data_len)
      and skb->data. Throughout the JIT phase this requires to keep track
      whether LD_ABS/LD_IND are used and if so, the two registers need
      to be recached each time a BPF helper would change the underlying
      packet data in native eBPF case. At least in eBPF case, available
      CPU registers are rare and the additional exit path out of the
      asm written JIT helper makes it also inflexible since not all
      parts of the JITer are in control from plain C. A LD_ABS/LD_IND
      implementation in eBPF therefore allows to significantly reduce
      the complexity in JITs with comparable performance results for
      them, e.g.:
      
      test_bpf             tcpdump port 22             tcpdump complex
      x64      - before    15 21 10                    14 19  18
               - after      7 10 10                     7 10  15
      arm64    - before    40 91 92                    40 91 151
               - after     51 64 73                    51 62 113
      
      For cBPF we now track any usage of LD_ABS/LD_IND in bpf_convert_filter()
      and cache the skb's headlen and data in the cBPF prologue. The
      BPF_REG_TMP gets remapped from R8 to R2 since it's mainly just
      used as a local temporary variable. This allows to shrink the
      image on x86_64 also for seccomp programs slightly since mapping
      to %rsi is not an ereg. In callee-saved R8 and R9 we now track
      skb data and headlen, respectively. For normal prologue emission
      in the JITs this does not add any extra instructions since R8, R9
      are pushed to stack in any case from eBPF side. cBPF uses the
      convert_bpf_ld_abs() emitter which probes the fast path inline
      already and falls back to bpf_skb_load_helper_{8,16,32}() helper
      relying on the cached skb data and headlen as well. R8 and R9
      never need to be reloaded due to bpf_helper_changes_pkt_data()
      since all skb access in cBPF is read-only. Then, for the case
      of native eBPF, we use the bpf_gen_ld_abs() emitter, which calls
      the bpf_skb_load_helper_{8,16,32}_no_cache() helper unconditionally,
      does neither cache skb data and headlen nor has an inlined fast
      path. The reason for the latter is that native eBPF does not have
      any extra registers available anyway, but even if there were, it
      avoids any reload of skb data and headlen in the first place.
      Additionally, for the negative offsets, we provide an alternative
      bpf_skb_load_bytes_relative() helper in eBPF which operates
      similarly as bpf_skb_load_bytes() and allows for more flexibility.
      Tested myself on x64, arm64, s390x, from Sandipan on ppc64.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      e0cea7ce
    • B
      bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP · fbfc504a
      Björn Töpel 提交于
      The xskmap is yet another BPF map, very much inspired by
      dev/cpu/sockmap, and is a holder of AF_XDP sockets. A user application
      adds AF_XDP sockets into the map, and by using the bpf_redirect_map
      helper, an XDP program can redirect XDP frames to an AF_XDP socket.
      
      Note that a socket that is bound to certain ifindex/queue index will
      *only* accept XDP frames from that netdev/queue index. If an XDP
      program tries to redirect from a netdev/queue index other than what
      the socket is bound to, the frame will not be received on the socket.
      
      A socket can reside in multiple maps.
      
      v3: Fixed race and simplified code.
      v2: Removed one indirection in map lookup.
      Signed-off-by: NBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      fbfc504a
  8. 29 4月, 2018 4 次提交
    • Y
      bpf/verifier: improve register value range tracking with ARSH · 9cbe1f5a
      Yonghong Song 提交于
      When helpers like bpf_get_stack returns an int value
      and later on used for arithmetic computation, the LSH and ARSH
      operations are often required to get proper sign extension into
      64-bit. For example, without this patch:
          54: R0=inv(id=0,umax_value=800)
          54: (bf) r8 = r0
          55: R0=inv(id=0,umax_value=800) R8_w=inv(id=0,umax_value=800)
          55: (67) r8 <<= 32
          56: R8_w=inv(id=0,umax_value=3435973836800,var_off=(0x0; 0x3ff00000000))
          56: (c7) r8 s>>= 32
          57: R8=inv(id=0)
      With this patch:
          54: R0=inv(id=0,umax_value=800)
          54: (bf) r8 = r0
          55: R0=inv(id=0,umax_value=800) R8_w=inv(id=0,umax_value=800)
          55: (67) r8 <<= 32
          56: R8_w=inv(id=0,umax_value=3435973836800,var_off=(0x0; 0x3ff00000000))
          56: (c7) r8 s>>= 32
          57: R8=inv(id=0, umax_value=800,var_off=(0x0; 0x3ff))
      With better range of "R8", later on when "R8" is added to other register,
      e.g., a map pointer or scalar-value register, the better register
      range can be derived and verifier failure may be avoided.
      
      In our later example,
          ......
          usize = bpf_get_stack(ctx, raw_data, max_len, BPF_F_USER_STACK);
          if (usize < 0)
              return 0;
          ksize = bpf_get_stack(ctx, raw_data + usize, max_len - usize, 0);
          ......
      Without improving ARSH value range tracking, the register representing
      "max_len - usize" will have smin_value equal to S64_MIN and will be
      rejected by verifier.
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      9cbe1f5a
    • Y
      bpf: remove never-hit branches in verifier adjust_scalar_min_max_vals · afbe1a5b
      Yonghong Song 提交于
      In verifier function adjust_scalar_min_max_vals,
      when src_known is false and the opcode is BPF_LSH/BPF_RSH,
      early return will happen in the function. So remove
      the branch in handling BPF_LSH/BPF_RSH when src_known is false.
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      afbe1a5b
    • Y
      bpf/verifier: refine retval R0 state for bpf_get_stack helper · 849fa506
      Yonghong Song 提交于
      The special property of return values for helpers bpf_get_stack
      and bpf_probe_read_str are captured in verifier.
      Both helpers return a negative error code or
      a length, which is equal to or smaller than the buffer
      size argument. This additional information in the
      verifier can avoid the condition such as "retval > bufsize"
      in the bpf program. For example, for the code blow,
          usize = bpf_get_stack(ctx, raw_data, max_len, BPF_F_USER_STACK);
          if (usize < 0 || usize > max_len)
              return 0;
      The verifier may have the following errors:
          52: (85) call bpf_get_stack#65
           R0=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R1_w=ctx(id=0,off=0,imm=0)
           R2_w=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R3_w=inv800 R4_w=inv256
           R6=ctx(id=0,off=0,imm=0) R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0)
           R9_w=inv800 R10=fp0,call_-1
          53: (bf) r8 = r0
          54: (bf) r1 = r8
          55: (67) r1 <<= 32
          56: (bf) r2 = r1
          57: (77) r2 >>= 32
          58: (25) if r2 > 0x31f goto pc+33
           R0=inv(id=0) R1=inv(id=0,smax_value=9223372032559808512,
                               umax_value=18446744069414584320,
                               var_off=(0x0; 0xffffffff00000000))
           R2=inv(id=0,umax_value=799,var_off=(0x0; 0x3ff))
           R6=ctx(id=0,off=0,imm=0) R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0)
           R8=inv(id=0) R9=inv800 R10=fp0,call_-1
          59: (1f) r9 -= r8
          60: (c7) r1 s>>= 32
          61: (bf) r2 = r7
          62: (0f) r2 += r1
          math between map_value pointer and register with unbounded
          min value is not allowed
      The failure is due to llvm compiler optimization where register "r2",
      which is a copy of "r1", is tested for condition while later on "r1"
      is used for map_ptr operation. The verifier is not able to track such
      inst sequence effectively.
      
      Without the "usize > max_len" condition, there is no llvm optimization
      and the below generated code passed verifier:
          52: (85) call bpf_get_stack#65
           R0=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R1_w=ctx(id=0,off=0,imm=0)
           R2_w=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R3_w=inv800 R4_w=inv256
           R6=ctx(id=0,off=0,imm=0) R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0)
           R9_w=inv800 R10=fp0,call_-1
          53: (b7) r1 = 0
          54: (bf) r8 = r0
          55: (67) r8 <<= 32
          56: (c7) r8 s>>= 32
          57: (6d) if r1 s> r8 goto pc+24
           R0=inv(id=0,umax_value=800,var_off=(0x0; 0x3ff))
           R1=inv0 R6=ctx(id=0,off=0,imm=0)
           R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0)
           R8=inv(id=0,umax_value=800,var_off=(0x0; 0x3ff)) R9=inv800
           R10=fp0,call_-1
          58: (bf) r2 = r7
          59: (0f) r2 += r8
          60: (1f) r9 -= r8
          61: (bf) r1 = r6
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      849fa506
    • Y
      bpf: add bpf_get_stack helper · c195651e
      Yonghong Song 提交于
      Currently, stackmap and bpf_get_stackid helper are provided
      for bpf program to get the stack trace. This approach has
      a limitation though. If two stack traces have the same hash,
      only one will get stored in the stackmap table,
      so some stack traces are missing from user perspective.
      
      This patch implements a new helper, bpf_get_stack, will
      send stack traces directly to bpf program. The bpf program
      is able to see all stack traces, and then can do in-kernel
      processing or send stack traces to user space through
      shared map or bpf_perf_event_output.
      Acked-by: NAlexei Starovoitov <ast@fb.com>
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      c195651e
  9. 25 4月, 2018 1 次提交
    • P
      bpf: allow map helpers access to map values directly · d71962f3
      Paul Chaignon 提交于
      Helpers that expect ARG_PTR_TO_MAP_KEY and ARG_PTR_TO_MAP_VALUE can only
      access stack and packet memory.  Allow these helpers to directly access
      map values by passing registers of type PTR_TO_MAP_VALUE.
      
      This change removes the need for an extra copy to the stack when using a
      map value to perform a second map lookup, as in the following:
      
      struct bpf_map_def SEC("maps") infobyreq = {
          .type = BPF_MAP_TYPE_HASHMAP,
          .key_size = sizeof(struct request *),
          .value_size = sizeof(struct info_t),
          .max_entries = 1024,
      };
      struct bpf_map_def SEC("maps") counts = {
          .type = BPF_MAP_TYPE_HASHMAP,
          .key_size = sizeof(struct info_t),
          .value_size = sizeof(u64),
          .max_entries = 1024,
      };
      SEC("kprobe/blk_account_io_start")
      int bpf_blk_account_io_start(struct pt_regs *ctx)
      {
          struct info_t *info = bpf_map_lookup_elem(&infobyreq, &ctx->di);
          u64 *count = bpf_map_lookup_elem(&counts, info);
          (*count)++;
      }
      Signed-off-by: NPaul Chaignon <paul.chaignon@orange.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      d71962f3
  10. 31 3月, 2018 2 次提交
    • A
      bpf: Hooks for sys_bind · 4fbac77d
      Andrey Ignatov 提交于
      == The problem ==
      
      There is a use-case when all processes inside a cgroup should use one
      single IP address on a host that has multiple IP configured.  Those
      processes should use the IP for both ingress and egress, for TCP and UDP
      traffic. So TCP/UDP servers should be bound to that IP to accept
      incoming connections on it, and TCP/UDP clients should make outgoing
      connections from that IP. It should not require changing application
      code since it's often not possible.
      
      Currently it's solved by intercepting glibc wrappers around syscalls
      such as `bind(2)` and `connect(2)`. It's done by a shared library that
      is preloaded for every process in a cgroup so that whenever TCP/UDP
      server calls `bind(2)`, the library replaces IP in sockaddr before
      passing arguments to syscall. When application calls `connect(2)` the
      library transparently binds the local end of connection to that IP
      (`bind(2)` with `IP_BIND_ADDRESS_NO_PORT` to avoid performance penalty).
      
      Shared library approach is fragile though, e.g.:
      * some applications clear env vars (incl. `LD_PRELOAD`);
      * `/etc/ld.so.preload` doesn't help since some applications are linked
        with option `-z nodefaultlib`;
      * other applications don't use glibc and there is nothing to intercept.
      
      == The solution ==
      
      The patch provides much more reliable in-kernel solution for the 1st
      part of the problem: binding TCP/UDP servers on desired IP. It does not
      depend on application environment and implementation details (whether
      glibc is used or not).
      
      It adds new eBPF program type `BPF_PROG_TYPE_CGROUP_SOCK_ADDR` and
      attach types `BPF_CGROUP_INET4_BIND` and `BPF_CGROUP_INET6_BIND`
      (similar to already existing `BPF_CGROUP_INET_SOCK_CREATE`).
      
      The new program type is intended to be used with sockets (`struct sock`)
      in a cgroup and provided by user `struct sockaddr`. Pointers to both of
      them are parts of the context passed to programs of newly added types.
      
      The new attach types provides hooks in `bind(2)` system call for both
      IPv4 and IPv6 so that one can write a program to override IP addresses
      and ports user program tries to bind to and apply such a program for
      whole cgroup.
      
      == Implementation notes ==
      
      [1]
      Separate attach types for `AF_INET` and `AF_INET6` are added
      intentionally to prevent reading/writing to offsets that don't make
      sense for corresponding socket family. E.g. if user passes `sockaddr_in`
      it doesn't make sense to read from / write to `user_ip6[]` context
      fields.
      
      [2]
      The write access to `struct bpf_sock_addr_kern` is implemented using
      special field as an additional "register".
      
      There are just two registers in `sock_addr_convert_ctx_access`: `src`
      with value to write and `dst` with pointer to context that can't be
      changed not to break later instructions. But the fields, allowed to
      write to, are not available directly and to access them address of
      corresponding pointer has to be loaded first. To get additional register
      the 1st not used by `src` and `dst` one is taken, its content is saved
      to `bpf_sock_addr_kern.tmp_reg`, then the register is used to load
      address of pointer field, and finally the register's content is restored
      from the temporary field after writing `src` value.
      Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      4fbac77d
    • A
      bpf: Check attach type at prog load time · 5e43f899
      Andrey Ignatov 提交于
      == The problem ==
      
      There are use-cases when a program of some type can be attached to
      multiple attach points and those attach points must have different
      permissions to access context or to call helpers.
      
      E.g. context structure may have fields for both IPv4 and IPv6 but it
      doesn't make sense to read from / write to IPv6 field when attach point
      is somewhere in IPv4 stack.
      
      Same applies to BPF-helpers: it may make sense to call some helper from
      some attach point, but not from other for same prog type.
      
      == The solution ==
      
      Introduce `expected_attach_type` field in in `struct bpf_attr` for
      `BPF_PROG_LOAD` command. If scenario described in "The problem" section
      is the case for some prog type, the field will be checked twice:
      
      1) At load time prog type is checked to see if attach type for it must
         be known to validate program permissions correctly. Prog will be
         rejected with EINVAL if it's the case and `expected_attach_type` is
         not specified or has invalid value.
      
      2) At attach time `attach_type` is compared with `expected_attach_type`,
         if prog type requires to have one, and, if they differ, attach will
         be rejected with EINVAL.
      
      The `expected_attach_type` is now available as part of `struct bpf_prog`
      in both `bpf_verifier_ops->is_valid_access()` and
      `bpf_verifier_ops->get_func_proto()` () and can be used to check context
      accesses and calls to helpers correspondingly.
      
      Initially the idea was discussed by Alexei Starovoitov <ast@fb.com> and
      Daniel Borkmann <daniel@iogearbox.net> here:
      https://marc.info/?l=linux-netdev&m=152107378717201&w=2Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      5e43f899
  11. 26 3月, 2018 2 次提交
  12. 24 3月, 2018 1 次提交
    • J
      bpf: Remove struct bpf_verifier_env argument from print_bpf_insn · abe08840
      Jiri Olsa 提交于
      We use print_bpf_insn in user space (bpftool and soon perf),
      so it'd be nice to keep it generic and strip it off the kernel
      struct bpf_verifier_env argument.
      
      This argument can be safely removed, because its users can
      use the struct bpf_insn_cbs::private_data to pass it.
      
      By changing the argument type  we can no longer have clean
      'verbose' alias to 'bpf_verifier_log_write' in verifier.c.
      Instead  we're adding the  'verbose' cb_print callback and
      removing the alias.
      
      This way we have new cb_print callback in place, and all
      the 'verbose(env, ...) calls in verifier.c will cleanly
      cast to 'verbose(void *, ...)' so no other change is
      needed.
      Signed-off-by: NJiri Olsa <jolsa@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      abe08840
  13. 20 3月, 2018 1 次提交
    • J
      bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data · 4f738adb
      John Fastabend 提交于
      This implements a BPF ULP layer to allow policy enforcement and
      monitoring at the socket layer. In order to support this a new
      program type BPF_PROG_TYPE_SK_MSG is used to run the policy at
      the sendmsg/sendpage hook. To attach the policy to sockets a
      sockmap is used with a new program attach type BPF_SK_MSG_VERDICT.
      
      Similar to previous sockmap usages when a sock is added to a
      sockmap, via a map update, if the map contains a BPF_SK_MSG_VERDICT
      program type attached then the BPF ULP layer is created on the
      socket and the attached BPF_PROG_TYPE_SK_MSG program is run for
      every msg in sendmsg case and page/offset in sendpage case.
      
      BPF_PROG_TYPE_SK_MSG Semantics/API:
      
      BPF_PROG_TYPE_SK_MSG supports only two return codes SK_PASS and
      SK_DROP. Returning SK_DROP free's the copied data in the sendmsg
      case and in the sendpage case leaves the data untouched. Both cases
      return -EACESS to the user. Returning SK_PASS will allow the msg to
      be sent.
      
      In the sendmsg case data is copied into kernel space buffers before
      running the BPF program. The kernel space buffers are stored in a
      scatterlist object where each element is a kernel memory buffer.
      Some effort is made to coalesce data from the sendmsg call here.
      For example a sendmsg call with many one byte iov entries will
      likely be pushed into a single entry. The BPF program is run with
      data pointers (start/end) pointing to the first sg element.
      
      In the sendpage case data is not copied. We opt not to copy the
      data by default here, because the BPF infrastructure does not
      know what bytes will be needed nor when they will be needed. So
      copying all bytes may be wasteful. Because of this the initial
      start/end data pointers are (0,0). Meaning no data can be read or
      written. This avoids reading data that may be modified by the
      user. A new helper is added later in this series if reading and
      writing the data is needed. The helper call will do a copy by
      default so that the page is exclusively owned by the BPF call.
      
      The verdict from the BPF_PROG_TYPE_SK_MSG applies to the entire msg
      in the sendmsg() case and the entire page/offset in the sendpage case.
      This avoids ambiguity on how to handle mixed return codes in the
      sendmsg case. Again a helper is added later in the series if
      a verdict needs to apply to multiple system calls and/or only
      a subpart of the currently being processed message.
      
      The helper msg_redirect_map() can be used to select the socket to
      send the data on. This is used similar to existing redirect use
      cases. This allows policy to redirect msgs.
      
      Pseudo code simple example:
      
      The basic logic to attach a program to a socket is as follows,
      
        // load the programs
        bpf_prog_load(SOCKMAP_TCP_MSG_PROG, BPF_PROG_TYPE_SK_MSG,
      		&obj, &msg_prog);
      
        // lookup the sockmap
        bpf_map_msg = bpf_object__find_map_by_name(obj, "my_sock_map");
      
        // get fd for sockmap
        map_fd_msg = bpf_map__fd(bpf_map_msg);
      
        // attach program to sockmap
        bpf_prog_attach(msg_prog, map_fd_msg, BPF_SK_MSG_VERDICT, 0);
      
      Adding sockets to the map is done in the normal way,
      
        // Add a socket 'fd' to sockmap at location 'i'
        bpf_map_update_elem(map_fd_msg, &i, fd, BPF_ANY);
      
      After the above any socket attached to "my_sock_map", in this case
      'fd', will run the BPF msg verdict program (msg_prog) on every
      sendmsg and sendpage system call.
      
      For a complete example see BPF selftests or sockmap samples.
      
      Implementation notes:
      
      It seemed the simplest, to me at least, to use a refcnt to ensure
      psock is not lost across the sendmsg copy into the sg, the bpf program
      running on the data in sg_data, and the final pass to the TCP stack.
      Some performance testing may show a better method to do this and avoid
      the refcnt cost, but for now use the simpler method.
      
      Another item that will come after basic support is in place is
      supporting MSG_MORE flag. At the moment we call sendpages even if
      the MSG_MORE flag is set. An enhancement would be to collect the
      pages into a larger scatterlist and pass down the stack. Notice that
      bpf_tcp_sendmsg() could support this with some additional state saved
      across sendmsg calls. I built the code to support this without having
      to do refactoring work. Other features TBD include ZEROCOPY and the
      TCP_RECV_QUEUE/TCP_NO_QUEUE support. This will follow initial series
      shortly.
      
      Future work could improve size limits on the scatterlist rings used
      here. Currently, we use MAX_SKB_FRAGS simply because this was being
      used already in the TLS case. Future work could extend the kernel sk
      APIs to tune this depending on workload. This is a trade-off
      between memory usage and throughput performance.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      4f738adb
  14. 24 2月, 2018 1 次提交
    • D
      bpf: allow xadd only on aligned memory · ca369602
      Daniel Borkmann 提交于
      The requirements around atomic_add() / atomic64_add() resp. their
      JIT implementations differ across architectures. E.g. while x86_64
      seems just fine with BPF's xadd on unaligned memory, on arm64 it
      triggers via interpreter but also JIT the following crash:
      
        [  830.864985] Unable to handle kernel paging request at virtual address ffff8097d7ed6703
        [...]
        [  830.916161] Internal error: Oops: 96000021 [#1] SMP
        [  830.984755] CPU: 37 PID: 2788 Comm: test_verifier Not tainted 4.16.0-rc2+ #8
        [  830.991790] Hardware name: Huawei TaiShan 2280 /BC11SPCD, BIOS 1.29 07/17/2017
        [  830.998998] pstate: 80400005 (Nzcv daif +PAN -UAO)
        [  831.003793] pc : __ll_sc_atomic_add+0x4/0x18
        [  831.008055] lr : ___bpf_prog_run+0x1198/0x1588
        [  831.012485] sp : ffff00001ccabc20
        [  831.015786] x29: ffff00001ccabc20 x28: ffff8017d56a0f00
        [  831.021087] x27: 0000000000000001 x26: 0000000000000000
        [  831.026387] x25: 000000c168d9db98 x24: 0000000000000000
        [  831.031686] x23: ffff000008203878 x22: ffff000009488000
        [  831.036986] x21: ffff000008b14e28 x20: ffff00001ccabcb0
        [  831.042286] x19: ffff0000097b5080 x18: 0000000000000a03
        [  831.047585] x17: 0000000000000000 x16: 0000000000000000
        [  831.052885] x15: 0000ffffaeca8000 x14: 0000000000000000
        [  831.058184] x13: 0000000000000000 x12: 0000000000000000
        [  831.063484] x11: 0000000000000001 x10: 0000000000000000
        [  831.068783] x9 : 0000000000000000 x8 : 0000000000000000
        [  831.074083] x7 : 0000000000000000 x6 : 000580d428000000
        [  831.079383] x5 : 0000000000000018 x4 : 0000000000000000
        [  831.084682] x3 : ffff00001ccabcb0 x2 : 0000000000000001
        [  831.089982] x1 : ffff8097d7ed6703 x0 : 0000000000000001
        [  831.095282] Process test_verifier (pid: 2788, stack limit = 0x0000000018370044)
        [  831.102577] Call trace:
        [  831.105012]  __ll_sc_atomic_add+0x4/0x18
        [  831.108923]  __bpf_prog_run32+0x4c/0x70
        [  831.112748]  bpf_test_run+0x78/0xf8
        [  831.116224]  bpf_prog_test_run_xdp+0xb4/0x120
        [  831.120567]  SyS_bpf+0x77c/0x1110
        [  831.123873]  el0_svc_naked+0x30/0x34
        [  831.127437] Code: 97fffe97 17ffffec 00000000 f9800031 (885f7c31)
      
      Reason for this is because memory is required to be aligned. In
      case of BPF, we always enforce alignment in terms of stack access,
      but not when accessing map values or packet data when the underlying
      arch (e.g. arm64) has CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS set.
      
      xadd on packet data that is local to us anyway is just wrong, so
      forbid this case entirely. The only place where xadd makes sense in
      fact are map values; xadd on stack is wrong as well, but it's been
      around for much longer. Specifically enforce strict alignment in case
      of xadd, so that we handle this case generically and avoid such crashes
      in the first place.
      
      Fixes: 17a52670 ("bpf: verifier (add verifier core)")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      ca369602
  15. 15 2月, 2018 1 次提交
  16. 27 1月, 2018 3 次提交
    • D
      bpf: fix subprog verifier bypass by div/mod by 0 exception · f6b1b3bf
      Daniel Borkmann 提交于
      One of the ugly leftovers from the early eBPF days is that div/mod
      operations based on registers have a hard-coded src_reg == 0 test
      in the interpreter as well as in JIT code generators that would
      return from the BPF program with exit code 0. This was basically
      adopted from cBPF interpreter for historical reasons.
      
      There are multiple reasons why this is very suboptimal and prone
      to bugs. To name one: the return code mapping for such abnormal
      program exit of 0 does not always match with a suitable program
      type's exit code mapping. For example, '0' in tc means action 'ok'
      where the packet gets passed further up the stack, which is just
      undesirable for such cases (e.g. when implementing policy) and
      also does not match with other program types.
      
      While trying to work out an exception handling scheme, I also
      noticed that programs crafted like the following will currently
      pass the verifier:
      
        0: (bf) r6 = r1
        1: (85) call pc+8
        caller:
         R6=ctx(id=0,off=0,imm=0) R10=fp0,call_-1
        callee:
         frame1: R1=ctx(id=0,off=0,imm=0) R10=fp0,call_1
        10: (b4) (u32) r2 = (u32) 0
        11: (b4) (u32) r3 = (u32) 1
        12: (3c) (u32) r3 /= (u32) r2
        13: (61) r0 = *(u32 *)(r1 +76)
        14: (95) exit
        returning from callee:
         frame1: R0_w=pkt(id=0,off=0,r=0,imm=0)
                 R1=ctx(id=0,off=0,imm=0) R2_w=inv0
                 R3_w=inv(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff))
                 R10=fp0,call_1
        to caller at 2:
         R0_w=pkt(id=0,off=0,r=0,imm=0) R6=ctx(id=0,off=0,imm=0)
         R10=fp0,call_-1
      
        from 14 to 2: R0=pkt(id=0,off=0,r=0,imm=0)
                      R6=ctx(id=0,off=0,imm=0) R10=fp0,call_-1
        2: (bf) r1 = r6
        3: (61) r1 = *(u32 *)(r1 +80)
        4: (bf) r2 = r0
        5: (07) r2 += 8
        6: (2d) if r2 > r1 goto pc+1
         R0=pkt(id=0,off=0,r=8,imm=0) R1=pkt_end(id=0,off=0,imm=0)
         R2=pkt(id=0,off=8,r=8,imm=0) R6=ctx(id=0,off=0,imm=0)
         R10=fp0,call_-1
        7: (71) r0 = *(u8 *)(r0 +0)
        8: (b7) r0 = 1
        9: (95) exit
      
        from 6 to 8: safe
        processed 16 insns (limit 131072), stack depth 0+0
      
      Basically what happens is that in the subprog we make use of a
      div/mod by 0 exception and in the 'normal' subprog's exit path
      we just return skb->data back to the main prog. This has the
      implication that the verifier thinks we always get a pkt pointer
      in R0 while we still have the implicit 'return 0' from the div
      as an alternative unconditional return path earlier. Thus, R0
      then contains 0, meaning back in the parent prog we get the
      address range of [0x0, skb->data_end] as read and writeable.
      Similar can be crafted with other pointer register types.
      
      Since i) BPF_ABS/IND is not allowed in programs that contain
      BPF to BPF calls (and generally it's also disadvised to use in
      native eBPF context), ii) unknown opcodes don't return zero
      anymore, iii) we don't return an exception code in dead branches,
      the only last missing case affected and to fix is the div/mod
      handling.
      
      What we would really need is some infrastructure to propagate
      exceptions all the way to the original prog unwinding the
      current stack and returning that code to the caller of the
      BPF program. In user space such exception handling for similar
      runtimes is typically implemented with setjmp(3) and longjmp(3)
      as one possibility which is not available in the kernel,
      though (kgdb used to implement it in kernel long time ago). I
      implemented a PoC exception handling mechanism into the BPF
      interpreter with porting setjmp()/longjmp() into x86_64 and
      adding a new internal BPF_ABRT opcode that can use a program
      specific exception code for all exception cases we have (e.g.
      div/mod by 0, unknown opcodes, etc). While this seems to work
      in the constrained BPF environment (meaning, here, we don't
      need to deal with state e.g. from memory allocations that we
      would need to undo before going into exception state), it still
      has various drawbacks: i) we would need to implement the
      setjmp()/longjmp() for every arch supported in the kernel and
      for x86_64, arm64, sparc64 JITs currently supporting calls,
      ii) it has unconditional additional cost on main program
      entry to store CPU register state in initial setjmp() call,
      and we would need some way to pass the jmp_buf down into
      ___bpf_prog_run() for main prog and all subprogs, but also
      storing on stack is not really nice (other option would be
      per-cpu storage for this, but it also has the drawback that
      we need to disable preemption for every BPF program types).
      All in all this approach would add a lot of complexity.
      
      Another poor-man's solution would be to have some sort of
      additional shared register or scratch buffer to hold state
      for exceptions, and test that after every call return to
      chain returns and pass R0 all the way down to BPF prog caller.
      This is also problematic in various ways: i) an additional
      register doesn't map well into JITs, and some other scratch
      space could only be on per-cpu storage, which, again has the
      side-effect that this only works when we disable preemption,
      or somewhere in the input context which is not available
      everywhere either, and ii) this adds significant runtime
      overhead by putting conditionals after each and every call,
      as well as implementation complexity.
      
      Yet another option is to teach verifier that div/mod can
      return an integer, which however is also complex to implement
      as verifier would need to walk such fake 'mov r0,<code>; exit;'
      sequeuence and there would still be no guarantee for having
      propagation of this further down to the BPF caller as proper
      exception code. For parent prog, it is also is not distinguishable
      from a normal return of a constant scalar value.
      
      The approach taken here is a completely different one with
      little complexity and no additional overhead involved in
      that we make use of the fact that a div/mod by 0 is undefined
      behavior. Instead of bailing out, we adapt the same behavior
      as on some major archs like ARMv8 [0] into eBPF as well:
      X div 0 results in 0, and X mod 0 results in X. aarch64 and
      aarch32 ISA do not generate any traps or otherwise aborts
      of program execution for unsigned divides. I verified this
      also with a test program compiled by gcc and clang, and the
      behavior matches with the spec. Going forward we adapt the
      eBPF verifier to emit such rewrites once div/mod by register
      was seen. cBPF is not touched and will keep existing 'return 0'
      semantics. Given the options, it seems the most suitable from
      all of them, also since major archs have similar schemes in
      place. Given this is all in the realm of undefined behavior,
      we still have the option to adapt if deemed necessary and
      this way we would also have the option of more flexibility
      from LLVM code generation side (which is then fully visible
      to verifier). Thus, this patch i) fixes the panic seen in
      above program and ii) doesn't bypass the verifier observations.
      
        [0] ARM Architecture Reference Manual, ARMv8 [ARM DDI 0487B.b]
            http://infocenter.arm.com/help/topic/com.arm.doc.ddi0487b.b/DDI0487B_b_armv8_arm.pdf
            1) aarch64 instruction set: section C3.4.7 and C6.2.279 (UDIV)
               "A division by zero results in a zero being written to
                the destination register, without any indication that
                the division by zero occurred."
            2) aarch32 instruction set: section F1.4.8 and F5.1.263 (UDIV)
               "For the SDIV and UDIV instructions, division by zero
                always returns a zero result."
      
      Fixes: f4d7e40a ("bpf: introduce function calls (verification)")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      f6b1b3bf
    • D
      bpf: make unknown opcode handling more robust · 5e581dad
      Daniel Borkmann 提交于
      Recent findings by syzcaller fixed in 7891a87e ("bpf: arsh is
      not supported in 32 bit alu thus reject it") triggered a warning
      in the interpreter due to unknown opcode not being rejected by
      the verifier. The 'return 0' for an unknown opcode is really not
      optimal, since with BPF to BPF calls, this would go untracked by
      the verifier.
      
      Do two things here to improve the situation: i) perform basic insn
      sanity check early on in the verification phase and reject every
      non-uapi insn right there. The bpf_opcode_in_insntable() table
      reuses the same mapping as the jumptable in ___bpf_prog_run() sans
      the non-public mappings. And ii) in ___bpf_prog_run() we do need
      to BUG in the case where the verifier would ever create an unknown
      opcode due to some rewrites.
      
      Note that JITs do not have such issues since they would punt to
      interpreter in these situations. Moreover, the BPF_JIT_ALWAYS_ON
      would also help to avoid such unknown opcodes in the first place.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      5e581dad
    • D
      bpf: improve dead code sanitizing · 2a5418a1
      Daniel Borkmann 提交于
      Given we recently had c131187d ("bpf: fix branch pruning
      logic") and 95a762e2 ("bpf: fix incorrect sign extension in
      check_alu_op()") in particular where before verifier skipped
      verification of the wrongly assumed dead branch, we should not
      just replace the dead code parts with nops (mov r0,r0). If there
      is a bug such as fixed in 95a762e2 in future again, where
      runtime could execute those insns, then one of the potential
      issues with the current setting would be that given the nops
      would be at the end of the program, we could execute out of
      bounds at some point.
      
      The best in such case would be to just exit the BPF program
      altogether and return an exception code. However, given this
      would require two instructions, and such a dead code gap could
      just be a single insn long, we would need to place 'r0 = X; ret'
      snippet at the very end after the user program or at the start
      before the program (where we'd skip that region on prog entry),
      and then place unconditional ja's into the dead code gap.
      
      While more complex but possible, there's still another block
      in the road that currently prevents from this, namely BPF to
      BPF calls. The issue here is that such exception could be
      returned from a callee, but the caller would not know that
      it's an exception that needs to be propagated further down.
      Alternative that has little complexity is to just use a ja-1
      code for now which will trap the execution here instead of
      silently doing bad things if we ever get there due to bugs.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      2a5418a1
  17. 20 1月, 2018 2 次提交
  18. 18 1月, 2018 1 次提交
    • D
      bpf: mark dst unknown on inconsistent {s, u}bounds adjustments · 6f16101e
      Daniel Borkmann 提交于
      syzkaller generated a BPF proglet and triggered a warning with
      the following:
      
        0: (b7) r0 = 0
        1: (d5) if r0 s<= 0x0 goto pc+0
         R0=inv0 R1=ctx(id=0,off=0,imm=0) R10=fp0
        2: (1f) r0 -= r1
         R0=inv0 R1=ctx(id=0,off=0,imm=0) R10=fp0
        verifier internal error: known but bad sbounds
      
      What happens is that in the first insn, r0's min/max value
      are both 0 due to the immediate assignment, later in the jsle
      test the bounds are updated for the min value in the false
      path, meaning, they yield smin_val = 1, smax_val = 0, and when
      ctx pointer is subtracted from r0, verifier bails out with the
      internal error and throwing a WARN since smin_val != smax_val
      for the known constant.
      
      For min_val > max_val scenario it means that reg_set_min_max()
      and reg_set_min_max_inv() (which both refine existing bounds)
      demonstrated that such branch cannot be taken at runtime.
      
      In above scenario for the case where it will be taken, the
      existing [0, 0] bounds are kept intact. Meaning, the rejection
      is not due to a verifier internal error, and therefore the
      WARN() is not necessary either.
      
      We could just reject such cases in adjust_{ptr,scalar}_min_max_vals()
      when either known scalars have smin_val != smax_val or
      umin_val != umax_val or any scalar reg with bounds
      smin_val > smax_val or umin_val > umax_val. However, there
      may be a small risk of breakage of buggy programs, so handle
      this more gracefully and in adjust_{ptr,scalar}_min_max_vals()
      just taint the dst reg as unknown scalar when we see ops with
      such kind of src reg.
      
      Reported-by: syzbot+6d362cadd45dc0a12ba4@syzkaller.appspotmail.com
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      6f16101e
  19. 17 1月, 2018 1 次提交
  20. 15 1月, 2018 2 次提交
    • J
      bpf: offload: add map offload infrastructure · a3884572
      Jakub Kicinski 提交于
      BPF map offload follow similar path to program offload.  At creation
      time users may specify ifindex of the device on which they want to
      create the map.  Map will be validated by the kernel's
      .map_alloc_check callback and device driver will be called for the
      actual allocation.  Map will have an empty set of operations
      associated with it (save for alloc and free callbacks).  The real
      device callbacks are kept in map->offload->dev_ops because they
      have slightly different signatures.  Map operations are called in
      process context so the driver may communicate with HW freely,
      msleep(), wait() etc.
      
      Map alloc and free callbacks are muxed via existing .ndo_bpf, and
      are always called with rtnl lock held.  Maps and programs are
      guaranteed to be destroyed before .ndo_uninit (i.e. before
      unregister_netdev() returns).  Map callbacks are invoked with
      bpf_devs_lock *read* locked, drivers must take care of exclusive
      locking if necessary.
      
      All offload-specific branches are marked with unlikely() (through
      bpf_map_is_dev_bound()), given that branch penalty will be
      negligible compared to IO anyway, and we don't want to penalize
      SW path unnecessarily.
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Reviewed-by: NQuentin Monnet <quentin.monnet@netronome.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      a3884572
    • A
      bpf: fix 32-bit divide by zero · 68fda450
      Alexei Starovoitov 提交于
      due to some JITs doing if (src_reg == 0) check in 64-bit mode
      for div/mod operations mask upper 32-bits of src register
      before doing the check
      
      Fixes: 62258278 ("net: filter: x86: internal BPF JIT")
      Fixes: 7a12b503 ("sparc64: Add eBPF JIT.")
      Reported-by: syzbot+48340bb518e88849e2e3@syzkaller.appspotmail.com
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      68fda450
  21. 11 1月, 2018 2 次提交