1. 20 3月, 2018 1 次提交
    • J
      bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data · 4f738adb
      John Fastabend 提交于
      This implements a BPF ULP layer to allow policy enforcement and
      monitoring at the socket layer. In order to support this a new
      program type BPF_PROG_TYPE_SK_MSG is used to run the policy at
      the sendmsg/sendpage hook. To attach the policy to sockets a
      sockmap is used with a new program attach type BPF_SK_MSG_VERDICT.
      
      Similar to previous sockmap usages when a sock is added to a
      sockmap, via a map update, if the map contains a BPF_SK_MSG_VERDICT
      program type attached then the BPF ULP layer is created on the
      socket and the attached BPF_PROG_TYPE_SK_MSG program is run for
      every msg in sendmsg case and page/offset in sendpage case.
      
      BPF_PROG_TYPE_SK_MSG Semantics/API:
      
      BPF_PROG_TYPE_SK_MSG supports only two return codes SK_PASS and
      SK_DROP. Returning SK_DROP free's the copied data in the sendmsg
      case and in the sendpage case leaves the data untouched. Both cases
      return -EACESS to the user. Returning SK_PASS will allow the msg to
      be sent.
      
      In the sendmsg case data is copied into kernel space buffers before
      running the BPF program. The kernel space buffers are stored in a
      scatterlist object where each element is a kernel memory buffer.
      Some effort is made to coalesce data from the sendmsg call here.
      For example a sendmsg call with many one byte iov entries will
      likely be pushed into a single entry. The BPF program is run with
      data pointers (start/end) pointing to the first sg element.
      
      In the sendpage case data is not copied. We opt not to copy the
      data by default here, because the BPF infrastructure does not
      know what bytes will be needed nor when they will be needed. So
      copying all bytes may be wasteful. Because of this the initial
      start/end data pointers are (0,0). Meaning no data can be read or
      written. This avoids reading data that may be modified by the
      user. A new helper is added later in this series if reading and
      writing the data is needed. The helper call will do a copy by
      default so that the page is exclusively owned by the BPF call.
      
      The verdict from the BPF_PROG_TYPE_SK_MSG applies to the entire msg
      in the sendmsg() case and the entire page/offset in the sendpage case.
      This avoids ambiguity on how to handle mixed return codes in the
      sendmsg case. Again a helper is added later in the series if
      a verdict needs to apply to multiple system calls and/or only
      a subpart of the currently being processed message.
      
      The helper msg_redirect_map() can be used to select the socket to
      send the data on. This is used similar to existing redirect use
      cases. This allows policy to redirect msgs.
      
      Pseudo code simple example:
      
      The basic logic to attach a program to a socket is as follows,
      
        // load the programs
        bpf_prog_load(SOCKMAP_TCP_MSG_PROG, BPF_PROG_TYPE_SK_MSG,
      		&obj, &msg_prog);
      
        // lookup the sockmap
        bpf_map_msg = bpf_object__find_map_by_name(obj, "my_sock_map");
      
        // get fd for sockmap
        map_fd_msg = bpf_map__fd(bpf_map_msg);
      
        // attach program to sockmap
        bpf_prog_attach(msg_prog, map_fd_msg, BPF_SK_MSG_VERDICT, 0);
      
      Adding sockets to the map is done in the normal way,
      
        // Add a socket 'fd' to sockmap at location 'i'
        bpf_map_update_elem(map_fd_msg, &i, fd, BPF_ANY);
      
      After the above any socket attached to "my_sock_map", in this case
      'fd', will run the BPF msg verdict program (msg_prog) on every
      sendmsg and sendpage system call.
      
      For a complete example see BPF selftests or sockmap samples.
      
      Implementation notes:
      
      It seemed the simplest, to me at least, to use a refcnt to ensure
      psock is not lost across the sendmsg copy into the sg, the bpf program
      running on the data in sg_data, and the final pass to the TCP stack.
      Some performance testing may show a better method to do this and avoid
      the refcnt cost, but for now use the simpler method.
      
      Another item that will come after basic support is in place is
      supporting MSG_MORE flag. At the moment we call sendpages even if
      the MSG_MORE flag is set. An enhancement would be to collect the
      pages into a larger scatterlist and pass down the stack. Notice that
      bpf_tcp_sendmsg() could support this with some additional state saved
      across sendmsg calls. I built the code to support this without having
      to do refactoring work. Other features TBD include ZEROCOPY and the
      TCP_RECV_QUEUE/TCP_NO_QUEUE support. This will follow initial series
      shortly.
      
      Future work could improve size limits on the scatterlist rings used
      here. Currently, we use MAX_SKB_FRAGS simply because this was being
      used already in the TLS case. Future work could extend the kernel sk
      APIs to tune this depending on workload. This is a trade-off
      between memory usage and throughput performance.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      4f738adb
  2. 24 2月, 2018 1 次提交
    • D
      bpf: allow xadd only on aligned memory · ca369602
      Daniel Borkmann 提交于
      The requirements around atomic_add() / atomic64_add() resp. their
      JIT implementations differ across architectures. E.g. while x86_64
      seems just fine with BPF's xadd on unaligned memory, on arm64 it
      triggers via interpreter but also JIT the following crash:
      
        [  830.864985] Unable to handle kernel paging request at virtual address ffff8097d7ed6703
        [...]
        [  830.916161] Internal error: Oops: 96000021 [#1] SMP
        [  830.984755] CPU: 37 PID: 2788 Comm: test_verifier Not tainted 4.16.0-rc2+ #8
        [  830.991790] Hardware name: Huawei TaiShan 2280 /BC11SPCD, BIOS 1.29 07/17/2017
        [  830.998998] pstate: 80400005 (Nzcv daif +PAN -UAO)
        [  831.003793] pc : __ll_sc_atomic_add+0x4/0x18
        [  831.008055] lr : ___bpf_prog_run+0x1198/0x1588
        [  831.012485] sp : ffff00001ccabc20
        [  831.015786] x29: ffff00001ccabc20 x28: ffff8017d56a0f00
        [  831.021087] x27: 0000000000000001 x26: 0000000000000000
        [  831.026387] x25: 000000c168d9db98 x24: 0000000000000000
        [  831.031686] x23: ffff000008203878 x22: ffff000009488000
        [  831.036986] x21: ffff000008b14e28 x20: ffff00001ccabcb0
        [  831.042286] x19: ffff0000097b5080 x18: 0000000000000a03
        [  831.047585] x17: 0000000000000000 x16: 0000000000000000
        [  831.052885] x15: 0000ffffaeca8000 x14: 0000000000000000
        [  831.058184] x13: 0000000000000000 x12: 0000000000000000
        [  831.063484] x11: 0000000000000001 x10: 0000000000000000
        [  831.068783] x9 : 0000000000000000 x8 : 0000000000000000
        [  831.074083] x7 : 0000000000000000 x6 : 000580d428000000
        [  831.079383] x5 : 0000000000000018 x4 : 0000000000000000
        [  831.084682] x3 : ffff00001ccabcb0 x2 : 0000000000000001
        [  831.089982] x1 : ffff8097d7ed6703 x0 : 0000000000000001
        [  831.095282] Process test_verifier (pid: 2788, stack limit = 0x0000000018370044)
        [  831.102577] Call trace:
        [  831.105012]  __ll_sc_atomic_add+0x4/0x18
        [  831.108923]  __bpf_prog_run32+0x4c/0x70
        [  831.112748]  bpf_test_run+0x78/0xf8
        [  831.116224]  bpf_prog_test_run_xdp+0xb4/0x120
        [  831.120567]  SyS_bpf+0x77c/0x1110
        [  831.123873]  el0_svc_naked+0x30/0x34
        [  831.127437] Code: 97fffe97 17ffffec 00000000 f9800031 (885f7c31)
      
      Reason for this is because memory is required to be aligned. In
      case of BPF, we always enforce alignment in terms of stack access,
      but not when accessing map values or packet data when the underlying
      arch (e.g. arm64) has CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS set.
      
      xadd on packet data that is local to us anyway is just wrong, so
      forbid this case entirely. The only place where xadd makes sense in
      fact are map values; xadd on stack is wrong as well, but it's been
      around for much longer. Specifically enforce strict alignment in case
      of xadd, so that we handle this case generically and avoid such crashes
      in the first place.
      
      Fixes: 17a52670 ("bpf: verifier (add verifier core)")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      ca369602
  3. 15 2月, 2018 1 次提交
  4. 27 1月, 2018 3 次提交
    • D
      bpf: fix subprog verifier bypass by div/mod by 0 exception · f6b1b3bf
      Daniel Borkmann 提交于
      One of the ugly leftovers from the early eBPF days is that div/mod
      operations based on registers have a hard-coded src_reg == 0 test
      in the interpreter as well as in JIT code generators that would
      return from the BPF program with exit code 0. This was basically
      adopted from cBPF interpreter for historical reasons.
      
      There are multiple reasons why this is very suboptimal and prone
      to bugs. To name one: the return code mapping for such abnormal
      program exit of 0 does not always match with a suitable program
      type's exit code mapping. For example, '0' in tc means action 'ok'
      where the packet gets passed further up the stack, which is just
      undesirable for such cases (e.g. when implementing policy) and
      also does not match with other program types.
      
      While trying to work out an exception handling scheme, I also
      noticed that programs crafted like the following will currently
      pass the verifier:
      
        0: (bf) r6 = r1
        1: (85) call pc+8
        caller:
         R6=ctx(id=0,off=0,imm=0) R10=fp0,call_-1
        callee:
         frame1: R1=ctx(id=0,off=0,imm=0) R10=fp0,call_1
        10: (b4) (u32) r2 = (u32) 0
        11: (b4) (u32) r3 = (u32) 1
        12: (3c) (u32) r3 /= (u32) r2
        13: (61) r0 = *(u32 *)(r1 +76)
        14: (95) exit
        returning from callee:
         frame1: R0_w=pkt(id=0,off=0,r=0,imm=0)
                 R1=ctx(id=0,off=0,imm=0) R2_w=inv0
                 R3_w=inv(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff))
                 R10=fp0,call_1
        to caller at 2:
         R0_w=pkt(id=0,off=0,r=0,imm=0) R6=ctx(id=0,off=0,imm=0)
         R10=fp0,call_-1
      
        from 14 to 2: R0=pkt(id=0,off=0,r=0,imm=0)
                      R6=ctx(id=0,off=0,imm=0) R10=fp0,call_-1
        2: (bf) r1 = r6
        3: (61) r1 = *(u32 *)(r1 +80)
        4: (bf) r2 = r0
        5: (07) r2 += 8
        6: (2d) if r2 > r1 goto pc+1
         R0=pkt(id=0,off=0,r=8,imm=0) R1=pkt_end(id=0,off=0,imm=0)
         R2=pkt(id=0,off=8,r=8,imm=0) R6=ctx(id=0,off=0,imm=0)
         R10=fp0,call_-1
        7: (71) r0 = *(u8 *)(r0 +0)
        8: (b7) r0 = 1
        9: (95) exit
      
        from 6 to 8: safe
        processed 16 insns (limit 131072), stack depth 0+0
      
      Basically what happens is that in the subprog we make use of a
      div/mod by 0 exception and in the 'normal' subprog's exit path
      we just return skb->data back to the main prog. This has the
      implication that the verifier thinks we always get a pkt pointer
      in R0 while we still have the implicit 'return 0' from the div
      as an alternative unconditional return path earlier. Thus, R0
      then contains 0, meaning back in the parent prog we get the
      address range of [0x0, skb->data_end] as read and writeable.
      Similar can be crafted with other pointer register types.
      
      Since i) BPF_ABS/IND is not allowed in programs that contain
      BPF to BPF calls (and generally it's also disadvised to use in
      native eBPF context), ii) unknown opcodes don't return zero
      anymore, iii) we don't return an exception code in dead branches,
      the only last missing case affected and to fix is the div/mod
      handling.
      
      What we would really need is some infrastructure to propagate
      exceptions all the way to the original prog unwinding the
      current stack and returning that code to the caller of the
      BPF program. In user space such exception handling for similar
      runtimes is typically implemented with setjmp(3) and longjmp(3)
      as one possibility which is not available in the kernel,
      though (kgdb used to implement it in kernel long time ago). I
      implemented a PoC exception handling mechanism into the BPF
      interpreter with porting setjmp()/longjmp() into x86_64 and
      adding a new internal BPF_ABRT opcode that can use a program
      specific exception code for all exception cases we have (e.g.
      div/mod by 0, unknown opcodes, etc). While this seems to work
      in the constrained BPF environment (meaning, here, we don't
      need to deal with state e.g. from memory allocations that we
      would need to undo before going into exception state), it still
      has various drawbacks: i) we would need to implement the
      setjmp()/longjmp() for every arch supported in the kernel and
      for x86_64, arm64, sparc64 JITs currently supporting calls,
      ii) it has unconditional additional cost on main program
      entry to store CPU register state in initial setjmp() call,
      and we would need some way to pass the jmp_buf down into
      ___bpf_prog_run() for main prog and all subprogs, but also
      storing on stack is not really nice (other option would be
      per-cpu storage for this, but it also has the drawback that
      we need to disable preemption for every BPF program types).
      All in all this approach would add a lot of complexity.
      
      Another poor-man's solution would be to have some sort of
      additional shared register or scratch buffer to hold state
      for exceptions, and test that after every call return to
      chain returns and pass R0 all the way down to BPF prog caller.
      This is also problematic in various ways: i) an additional
      register doesn't map well into JITs, and some other scratch
      space could only be on per-cpu storage, which, again has the
      side-effect that this only works when we disable preemption,
      or somewhere in the input context which is not available
      everywhere either, and ii) this adds significant runtime
      overhead by putting conditionals after each and every call,
      as well as implementation complexity.
      
      Yet another option is to teach verifier that div/mod can
      return an integer, which however is also complex to implement
      as verifier would need to walk such fake 'mov r0,<code>; exit;'
      sequeuence and there would still be no guarantee for having
      propagation of this further down to the BPF caller as proper
      exception code. For parent prog, it is also is not distinguishable
      from a normal return of a constant scalar value.
      
      The approach taken here is a completely different one with
      little complexity and no additional overhead involved in
      that we make use of the fact that a div/mod by 0 is undefined
      behavior. Instead of bailing out, we adapt the same behavior
      as on some major archs like ARMv8 [0] into eBPF as well:
      X div 0 results in 0, and X mod 0 results in X. aarch64 and
      aarch32 ISA do not generate any traps or otherwise aborts
      of program execution for unsigned divides. I verified this
      also with a test program compiled by gcc and clang, and the
      behavior matches with the spec. Going forward we adapt the
      eBPF verifier to emit such rewrites once div/mod by register
      was seen. cBPF is not touched and will keep existing 'return 0'
      semantics. Given the options, it seems the most suitable from
      all of them, also since major archs have similar schemes in
      place. Given this is all in the realm of undefined behavior,
      we still have the option to adapt if deemed necessary and
      this way we would also have the option of more flexibility
      from LLVM code generation side (which is then fully visible
      to verifier). Thus, this patch i) fixes the panic seen in
      above program and ii) doesn't bypass the verifier observations.
      
        [0] ARM Architecture Reference Manual, ARMv8 [ARM DDI 0487B.b]
            http://infocenter.arm.com/help/topic/com.arm.doc.ddi0487b.b/DDI0487B_b_armv8_arm.pdf
            1) aarch64 instruction set: section C3.4.7 and C6.2.279 (UDIV)
               "A division by zero results in a zero being written to
                the destination register, without any indication that
                the division by zero occurred."
            2) aarch32 instruction set: section F1.4.8 and F5.1.263 (UDIV)
               "For the SDIV and UDIV instructions, division by zero
                always returns a zero result."
      
      Fixes: f4d7e40a ("bpf: introduce function calls (verification)")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      f6b1b3bf
    • D
      bpf: make unknown opcode handling more robust · 5e581dad
      Daniel Borkmann 提交于
      Recent findings by syzcaller fixed in 7891a87e ("bpf: arsh is
      not supported in 32 bit alu thus reject it") triggered a warning
      in the interpreter due to unknown opcode not being rejected by
      the verifier. The 'return 0' for an unknown opcode is really not
      optimal, since with BPF to BPF calls, this would go untracked by
      the verifier.
      
      Do two things here to improve the situation: i) perform basic insn
      sanity check early on in the verification phase and reject every
      non-uapi insn right there. The bpf_opcode_in_insntable() table
      reuses the same mapping as the jumptable in ___bpf_prog_run() sans
      the non-public mappings. And ii) in ___bpf_prog_run() we do need
      to BUG in the case where the verifier would ever create an unknown
      opcode due to some rewrites.
      
      Note that JITs do not have such issues since they would punt to
      interpreter in these situations. Moreover, the BPF_JIT_ALWAYS_ON
      would also help to avoid such unknown opcodes in the first place.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      5e581dad
    • D
      bpf: improve dead code sanitizing · 2a5418a1
      Daniel Borkmann 提交于
      Given we recently had c131187d ("bpf: fix branch pruning
      logic") and 95a762e2 ("bpf: fix incorrect sign extension in
      check_alu_op()") in particular where before verifier skipped
      verification of the wrongly assumed dead branch, we should not
      just replace the dead code parts with nops (mov r0,r0). If there
      is a bug such as fixed in 95a762e2 in future again, where
      runtime could execute those insns, then one of the potential
      issues with the current setting would be that given the nops
      would be at the end of the program, we could execute out of
      bounds at some point.
      
      The best in such case would be to just exit the BPF program
      altogether and return an exception code. However, given this
      would require two instructions, and such a dead code gap could
      just be a single insn long, we would need to place 'r0 = X; ret'
      snippet at the very end after the user program or at the start
      before the program (where we'd skip that region on prog entry),
      and then place unconditional ja's into the dead code gap.
      
      While more complex but possible, there's still another block
      in the road that currently prevents from this, namely BPF to
      BPF calls. The issue here is that such exception could be
      returned from a callee, but the caller would not know that
      it's an exception that needs to be propagated further down.
      Alternative that has little complexity is to just use a ja-1
      code for now which will trap the execution here instead of
      silently doing bad things if we ever get there due to bugs.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      2a5418a1
  5. 20 1月, 2018 2 次提交
  6. 18 1月, 2018 1 次提交
    • D
      bpf: mark dst unknown on inconsistent {s, u}bounds adjustments · 6f16101e
      Daniel Borkmann 提交于
      syzkaller generated a BPF proglet and triggered a warning with
      the following:
      
        0: (b7) r0 = 0
        1: (d5) if r0 s<= 0x0 goto pc+0
         R0=inv0 R1=ctx(id=0,off=0,imm=0) R10=fp0
        2: (1f) r0 -= r1
         R0=inv0 R1=ctx(id=0,off=0,imm=0) R10=fp0
        verifier internal error: known but bad sbounds
      
      What happens is that in the first insn, r0's min/max value
      are both 0 due to the immediate assignment, later in the jsle
      test the bounds are updated for the min value in the false
      path, meaning, they yield smin_val = 1, smax_val = 0, and when
      ctx pointer is subtracted from r0, verifier bails out with the
      internal error and throwing a WARN since smin_val != smax_val
      for the known constant.
      
      For min_val > max_val scenario it means that reg_set_min_max()
      and reg_set_min_max_inv() (which both refine existing bounds)
      demonstrated that such branch cannot be taken at runtime.
      
      In above scenario for the case where it will be taken, the
      existing [0, 0] bounds are kept intact. Meaning, the rejection
      is not due to a verifier internal error, and therefore the
      WARN() is not necessary either.
      
      We could just reject such cases in adjust_{ptr,scalar}_min_max_vals()
      when either known scalars have smin_val != smax_val or
      umin_val != umax_val or any scalar reg with bounds
      smin_val > smax_val or umin_val > umax_val. However, there
      may be a small risk of breakage of buggy programs, so handle
      this more gracefully and in adjust_{ptr,scalar}_min_max_vals()
      just taint the dst reg as unknown scalar when we see ops with
      such kind of src reg.
      
      Reported-by: syzbot+6d362cadd45dc0a12ba4@syzkaller.appspotmail.com
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      6f16101e
  7. 17 1月, 2018 1 次提交
  8. 15 1月, 2018 2 次提交
    • J
      bpf: offload: add map offload infrastructure · a3884572
      Jakub Kicinski 提交于
      BPF map offload follow similar path to program offload.  At creation
      time users may specify ifindex of the device on which they want to
      create the map.  Map will be validated by the kernel's
      .map_alloc_check callback and device driver will be called for the
      actual allocation.  Map will have an empty set of operations
      associated with it (save for alloc and free callbacks).  The real
      device callbacks are kept in map->offload->dev_ops because they
      have slightly different signatures.  Map operations are called in
      process context so the driver may communicate with HW freely,
      msleep(), wait() etc.
      
      Map alloc and free callbacks are muxed via existing .ndo_bpf, and
      are always called with rtnl lock held.  Maps and programs are
      guaranteed to be destroyed before .ndo_uninit (i.e. before
      unregister_netdev() returns).  Map callbacks are invoked with
      bpf_devs_lock *read* locked, drivers must take care of exclusive
      locking if necessary.
      
      All offload-specific branches are marked with unlikely() (through
      bpf_map_is_dev_bound()), given that branch penalty will be
      negligible compared to IO anyway, and we don't want to penalize
      SW path unnecessarily.
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Reviewed-by: NQuentin Monnet <quentin.monnet@netronome.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      a3884572
    • A
      bpf: fix 32-bit divide by zero · 68fda450
      Alexei Starovoitov 提交于
      due to some JITs doing if (src_reg == 0) check in 64-bit mode
      for div/mod operations mask upper 32-bits of src register
      before doing the check
      
      Fixes: 62258278 ("net: filter: x86: internal BPF JIT")
      Fixes: 7a12b503 ("sparc64: Add eBPF JIT.")
      Reported-by: syzbot+48340bb518e88849e2e3@syzkaller.appspotmail.com
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      68fda450
  9. 11 1月, 2018 2 次提交
  10. 10 1月, 2018 1 次提交
    • Q
      bpf: export function to write into verifier log buffer · 430e68d1
      Quentin Monnet 提交于
      Rename the BPF verifier `verbose()` to `bpf_verifier_log_write()` and
      export it, so that other components (in particular, drivers for BPF
      offload) can reuse the user buffer log to dump error messages at
      verification time.
      
      Renaming `verbose()` was necessary in order to avoid a name so generic
      to be exported to the global namespace. However to prevent too much pain
      for backports, the calls to `verbose()` in the kernel BPF verifier were
      not changed. Instead, use function aliasing to make `verbose` point to
      `bpf_verifier_log_write`. Another solution could consist in making a
      wrapper around `verbose()`, but since it is a variadic function, I don't
      see a clean way without creating two identical wrappers, one for the
      verifier and one to export.
      Signed-off-by: NQuentin Monnet <quentin.monnet@netronome.com>
      Reviewed-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      430e68d1
  11. 09 1月, 2018 2 次提交
    • A
      bpf: prevent out-of-bounds speculation · b2157399
      Alexei Starovoitov 提交于
      Under speculation, CPUs may mis-predict branches in bounds checks. Thus,
      memory accesses under a bounds check may be speculated even if the
      bounds check fails, providing a primitive for building a side channel.
      
      To avoid leaking kernel data round up array-based maps and mask the index
      after bounds check, so speculated load with out of bounds index will load
      either valid value from the array or zero from the padded area.
      
      Unconditionally mask index for all array types even when max_entries
      are not rounded to power of 2 for root user.
      When map is created by unpriv user generate a sequence of bpf insns
      that includes AND operation to make sure that JITed code includes
      the same 'index & index_mask' operation.
      
      If prog_array map is created by unpriv user replace
        bpf_tail_call(ctx, map, index);
      with
        if (index >= max_entries) {
          index &= map->index_mask;
          bpf_tail_call(ctx, map, index);
        }
      (along with roundup to power 2) to prevent out-of-bounds speculation.
      There is secondary redundant 'if (index >= max_entries)' in the interpreter
      and in all JITs, but they can be optimized later if necessary.
      
      Other array-like maps (cpumap, devmap, sockmap, perf_event_array, cgroup_array)
      cannot be used by unpriv, so no changes there.
      
      That fixes bpf side of "Variant 1: bounds check bypass (CVE-2017-5753)" on
      all architectures with and without JIT.
      
      v2->v3:
      Daniel noticed that attack potentially can be crafted via syscall commands
      without loading the program, so add masking to those paths as well.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      b2157399
    • A
      bpf: fix verifier GPF in kmalloc failure path · 5896351e
      Alexei Starovoitov 提交于
      syzbot reported the following panic in the verifier triggered
      by kmalloc error injection:
      
      kasan: GPF could be caused by NULL-ptr deref or user memory access
      RIP: 0010:copy_func_state kernel/bpf/verifier.c:403 [inline]
      RIP: 0010:copy_verifier_state+0x364/0x590 kernel/bpf/verifier.c:431
      Call Trace:
       pop_stack+0x8c/0x270 kernel/bpf/verifier.c:449
       push_stack kernel/bpf/verifier.c:491 [inline]
       check_cond_jmp_op kernel/bpf/verifier.c:3598 [inline]
       do_check+0x4b60/0xa050 kernel/bpf/verifier.c:4731
       bpf_check+0x3296/0x58c0 kernel/bpf/verifier.c:5489
       bpf_prog_load+0xa2a/0x1b00 kernel/bpf/syscall.c:1198
       SYSC_bpf kernel/bpf/syscall.c:1807 [inline]
       SyS_bpf+0x1044/0x4420 kernel/bpf/syscall.c:1769
      
      when copy_verifier_state() aborts in the middle due to kmalloc failure
      some of the frames could have been partially copied while
      current free_verifier_state() loop
      for (i = 0; i <= state->curframe; i++)
      assumed that all frames are non-null.
      Simply fix it by adding 'if (!state)' to free_func_state().
      Also avoid stressing copy frame logic more if kzalloc fails
      in push_stack() free env->cur_state right away.
      
      Fixes: f4d7e40a ("bpf: introduce function calls (verification)")
      Reported-by: syzbot+32ac5a3e473f2e01cfc7@syzkaller.appspotmail.com
      Reported-by: syzbot+fa99e24f3c29d269a7d5@syzkaller.appspotmail.com
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      5896351e
  12. 31 12月, 2017 1 次提交
  13. 28 12月, 2017 2 次提交
  14. 24 12月, 2017 1 次提交
    • G
      bpf: fix stacksafe exploration when comparing states · fd05e57b
      Gianluca Borello 提交于
      Commit cc2b14d5 ("bpf: teach verifier to recognize zero initialized
      stack") introduced a very relaxed check when comparing stacks of different
      states, effectively returning a positive result in many cases where it
      shouldn't.
      
      This can create problems in cases such as this following C pseudocode:
      
      long var;
      long *x = bpf_map_lookup(...);
      if (!x)
              return;
      
      if (*x != 0xbeef)
              var = 0;
      else
              var = 1;
      
      /* This is the key part, calling a helper causes an explored state
       * to be saved with the information that "var" is on the stack as
       * STACK_ZERO, since the helper is first met by the verifier after
       * the "var = 0" assignment. This state will however be wrongly used
       * also for the "var = 1" case, so the verifier assumes "var" is always
       * 0 and will replace the NULL assignment with nops, because the
       * search pruning prevents it from exploring the faulty branch.
       */
      bpf_ktime_get_ns();
      
      if (var)
              *(long *)0 = 0xbeef;
      
      Fix the issue by making sure that the stack is fully explored before
      returning a positive comparison result.
      
      Also attach a couple tests that highlight the bad behavior. In the first
      test, without this fix instructions 16 and 17 are replaced with nops
      instead of being rejected by the verifier.
      
      The second test, instead, allows a program to make a potentially illegal
      read from the stack.
      
      Fixes: cc2b14d5 ("bpf: teach verifier to recognize zero initialized stack")
      Signed-off-by: NGianluca Borello <g.borello@gmail.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      fd05e57b
  15. 21 12月, 2017 11 次提交
    • D
      bpf: allow for correlation of maps and helpers in dump · 7105e828
      Daniel Borkmann 提交于
      Currently a dump of an xlated prog (post verifier stage) doesn't
      correlate used helpers as well as maps. The prog info lists
      involved map ids, however there's no correlation of where in the
      program they are used as of today. Likewise, bpftool does not
      correlate helper calls with the target functions.
      
      The latter can be done w/o any kernel changes through kallsyms,
      and also has the advantage that this works with inlined helpers
      and BPF calls.
      
      Example, via interpreter:
      
        # tc filter show dev foo ingress
        filter protocol all pref 49152 bpf chain 0
        filter protocol all pref 49152 bpf chain 0 handle 0x1 foo.o:[ingress] \
                            direct-action not_in_hw id 1 tag c74773051b364165   <-- prog id:1
      
        * Output before patch (calls/maps remain unclear):
      
        # bpftool prog dump xlated id 1             <-- dump prog id:1
         0: (b7) r1 = 2
         1: (63) *(u32 *)(r10 -4) = r1
         2: (bf) r2 = r10
         3: (07) r2 += -4
         4: (18) r1 = 0xffff95c47a8d4800
         6: (85) call unknown#73040
         7: (15) if r0 == 0x0 goto pc+18
         8: (bf) r2 = r10
         9: (07) r2 += -4
        10: (bf) r1 = r0
        11: (85) call unknown#73040
        12: (15) if r0 == 0x0 goto pc+23
        [...]
      
        * Output after patch:
      
        # bpftool prog dump xlated id 1
         0: (b7) r1 = 2
         1: (63) *(u32 *)(r10 -4) = r1
         2: (bf) r2 = r10
         3: (07) r2 += -4
         4: (18) r1 = map[id:2]                     <-- map id:2
         6: (85) call bpf_map_lookup_elem#73424     <-- helper call
         7: (15) if r0 == 0x0 goto pc+18
         8: (bf) r2 = r10
         9: (07) r2 += -4
        10: (bf) r1 = r0
        11: (85) call bpf_map_lookup_elem#73424
        12: (15) if r0 == 0x0 goto pc+23
        [...]
      
        # bpftool map show id 2                     <-- show/dump/etc map id:2
        2: hash_of_maps  flags 0x0
              key 4B  value 4B  max_entries 3  memlock 4096B
      
      Example, JITed, same prog:
      
        # tc filter show dev foo ingress
        filter protocol all pref 49152 bpf chain 0
        filter protocol all pref 49152 bpf chain 0 handle 0x1 foo.o:[ingress] \
                        direct-action not_in_hw id 3 tag c74773051b364165 jited
      
        # bpftool prog show id 3
        3: sched_cls  tag c74773051b364165
              loaded_at Dec 19/13:48  uid 0
              xlated 384B  jited 257B  memlock 4096B  map_ids 2
      
        # bpftool prog dump xlated id 3
         0: (b7) r1 = 2
         1: (63) *(u32 *)(r10 -4) = r1
         2: (bf) r2 = r10
         3: (07) r2 += -4
         4: (18) r1 = map[id:2]                      <-- map id:2
         6: (85) call __htab_map_lookup_elem#77408   <-+ inlined rewrite
         7: (15) if r0 == 0x0 goto pc+2                |
         8: (07) r0 += 56                              |
         9: (79) r0 = *(u64 *)(r0 +0)                <-+
        10: (15) if r0 == 0x0 goto pc+24
        11: (bf) r2 = r10
        12: (07) r2 += -4
        [...]
      
      Example, same prog, but kallsyms disabled (in that case we are
      also not allowed to pass any relative offsets, etc, so prog
      becomes pointer sanitized on dump):
      
        # sysctl kernel.kptr_restrict=2
        kernel.kptr_restrict = 2
      
        # bpftool prog dump xlated id 3
         0: (b7) r1 = 2
         1: (63) *(u32 *)(r10 -4) = r1
         2: (bf) r2 = r10
         3: (07) r2 += -4
         4: (18) r1 = map[id:2]
         6: (85) call bpf_unspec#0
         7: (15) if r0 == 0x0 goto pc+2
        [...]
      
      Example, BPF calls via interpreter:
      
        # bpftool prog dump xlated id 1
         0: (85) call pc+2#__bpf_prog_run_args32
         1: (b7) r0 = 1
         2: (95) exit
         3: (b7) r0 = 2
         4: (95) exit
      
      Example, BPF calls via JIT:
      
        # sysctl net.core.bpf_jit_enable=1
        net.core.bpf_jit_enable = 1
        # sysctl net.core.bpf_jit_kallsyms=1
        net.core.bpf_jit_kallsyms = 1
      
        # bpftool prog dump xlated id 1
         0: (85) call pc+2#bpf_prog_3b185187f1855c4c_F
         1: (b7) r0 = 1
         2: (95) exit
         3: (b7) r0 = 2
         4: (95) exit
      
      And finally, an example for tail calls that is now working
      as well wrt correlation:
      
        # bpftool prog dump xlated id 2
        [...]
        10: (b7) r2 = 8
        11: (85) call bpf_trace_printk#-41312
        12: (bf) r1 = r6
        13: (18) r2 = map[id:1]
        15: (b7) r3 = 0
        16: (85) call bpf_tail_call#12
        17: (b7) r1 = 42
        18: (6b) *(u16 *)(r6 +46) = r1
        19: (b7) r0 = 0
        20: (95) exit
      
        # bpftool map show id 1
        1: prog_array  flags 0x0
              key 4B  value 4B  max_entries 1  memlock 4096B
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      7105e828
    • D
      bpf: fix kallsyms handling for subprogs · 4f74d809
      Daniel Borkmann 提交于
      Right now kallsyms handling is not working with JITed subprogs.
      The reason is that when in 1c2a088a ("bpf: x64: add JIT support
      for multi-function programs") in jit_subprogs() they are passed
      to bpf_prog_kallsyms_add(), then their prog type is 0, which BPF
      core will think it's a cBPF program as only cBPF programs have a
      0 type. Thus, they need to inherit the type from the main prog.
      
      Once that is fixed, they are indeed added to the BPF kallsyms
      infra, but their tag is 0. Therefore, since intention is to add
      them as bpf_prog_F_<tag>, we need to pass them to bpf_prog_calc_tag()
      first. And once this is resolved, there is a use-after-free on
      prog cleanup: we remove the kallsyms entry from the main prog,
      later walk all subprogs and call bpf_jit_free() on them. However,
      the kallsyms linkage was never released on them. Thus, do that
      for all subprogs right in __bpf_prog_put() when refcount hits 0.
      
      Fixes: 1c2a088a ("bpf: x64: add JIT support for multi-function programs")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      4f74d809
    • A
      bpf: do not allow root to mangle valid pointers · 82abbf8d
      Alexei Starovoitov 提交于
      Do not allow root to convert valid pointers into unknown scalars.
      In particular disallow:
       ptr &= reg
       ptr <<= reg
       ptr += ptr
      and explicitly allow:
       ptr -= ptr
      since pkt_end - pkt == length
      
      1.
      This minimizes amount of address leaks root can do.
      In the future may need to further tighten the leaks with kptr_restrict.
      
      2.
      If program has such pointer math it's likely a user mistake and
      when verifier complains about it right away instead of many instructions
      later on invalid memory access it's easier for users to fix their progs.
      
      3.
      when register holding a pointer cannot change to scalar it allows JITs to
      optimize better. Like 32-bit archs could use single register for pointers
      instead of a pair required to hold 64-bit scalars.
      
      4.
      reduces architecture dependent behavior. Since code:
      r1 = r10;
      r1 &= 0xff;
      if (r1 ...)
      will behave differently arm64 vs x64 and offloaded vs native.
      
      A significant chunk of ptr mangling was allowed by
      commit f1174f77 ("bpf/verifier: rework value tracking")
      yet some of it was allowed even earlier.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      82abbf8d
    • A
      bpf: fix integer overflows · bb7f0f98
      Alexei Starovoitov 提交于
      There were various issues related to the limited size of integers used in
      the verifier:
       - `off + size` overflow in __check_map_access()
       - `off + reg->off` overflow in check_mem_access()
       - `off + reg->var_off.value` overflow or 32-bit truncation of
         `reg->var_off.value` in check_mem_access()
       - 32-bit truncation in check_stack_boundary()
      
      Make sure that any integer math cannot overflow by not allowing
      pointer math with large values.
      
      Also reduce the scope of "scalar op scalar" tracking.
      
      Fixes: f1174f77 ("bpf/verifier: rework value tracking")
      Reported-by: NJann Horn <jannh@google.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      bb7f0f98
    • J
      bpf: don't prune branches when a scalar is replaced with a pointer · 179d1c56
      Jann Horn 提交于
      This could be made safe by passing through a reference to env and checking
      for env->allow_ptr_leaks, but it would only work one way and is probably
      not worth the hassle - not doing it will not directly lead to program
      rejection.
      
      Fixes: f1174f77 ("bpf/verifier: rework value tracking")
      Signed-off-by: NJann Horn <jannh@google.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      179d1c56
    • J
      bpf: force strict alignment checks for stack pointers · a5ec6ae1
      Jann Horn 提交于
      Force strict alignment checks for stack pointers because the tracking of
      stack spills relies on it; unaligned stack accesses can lead to corruption
      of spilled registers, which is exploitable.
      
      Fixes: f1174f77 ("bpf/verifier: rework value tracking")
      Signed-off-by: NJann Horn <jannh@google.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      a5ec6ae1
    • J
      bpf: fix missing error return in check_stack_boundary() · ea25f914
      Jann Horn 提交于
      Prevent indirect stack accesses at non-constant addresses, which would
      permit reading and corrupting spilled pointers.
      
      Fixes: f1174f77 ("bpf/verifier: rework value tracking")
      Signed-off-by: NJann Horn <jannh@google.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      ea25f914
    • J
      bpf: fix 32-bit ALU op verification · 468f6eaf
      Jann Horn 提交于
      32-bit ALU ops operate on 32-bit values and have 32-bit outputs.
      Adjust the verifier accordingly.
      
      Fixes: f1174f77 ("bpf/verifier: rework value tracking")
      Signed-off-by: NJann Horn <jannh@google.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      468f6eaf
    • J
      bpf: fix incorrect tracking of register size truncation · 0c17d1d2
      Jann Horn 提交于
      Properly handle register truncation to a smaller size.
      
      The old code first mirrors the clearing of the high 32 bits in the bitwise
      tristate representation, which is correct. But then, it computes the new
      arithmetic bounds as the intersection between the old arithmetic bounds and
      the bounds resulting from the bitwise tristate representation. Therefore,
      when coerce_reg_to_32() is called on a number with bounds
      [0xffff'fff8, 0x1'0000'0007], the verifier computes
      [0xffff'fff8, 0xffff'ffff] as bounds of the truncated number.
      This is incorrect: The truncated number could also be in the range [0, 7],
      and no meaningful arithmetic bounds can be computed in that case apart from
      the obvious [0, 0xffff'ffff].
      
      Starting with v4.14, this is exploitable by unprivileged users as long as
      the unprivileged_bpf_disabled sysctl isn't set.
      
      Debian assigned CVE-2017-16996 for this issue.
      
      v2:
       - flip the mask during arithmetic bounds calculation (Ben Hutchings)
      v3:
       - add CVE number (Ben Hutchings)
      
      Fixes: b03c9f9f ("bpf/verifier: track signed and unsigned min/max values")
      Signed-off-by: NJann Horn <jannh@google.com>
      Acked-by: NEdward Cree <ecree@solarflare.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      0c17d1d2
    • J
      bpf: fix incorrect sign extension in check_alu_op() · 95a762e2
      Jann Horn 提交于
      Distinguish between
      BPF_ALU64|BPF_MOV|BPF_K (load 32-bit immediate, sign-extended to 64-bit)
      and BPF_ALU|BPF_MOV|BPF_K (load 32-bit immediate, zero-padded to 64-bit);
      only perform sign extension in the first case.
      
      Starting with v4.14, this is exploitable by unprivileged users as long as
      the unprivileged_bpf_disabled sysctl isn't set.
      
      Debian assigned CVE-2017-16995 for this issue.
      
      v3:
       - add CVE number (Ben Hutchings)
      
      Fixes: 48461135 ("bpf: allow access into map value arrays")
      Signed-off-by: NJann Horn <jannh@google.com>
      Acked-by: NEdward Cree <ecree@solarflare.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      95a762e2
    • E
      bpf/verifier: fix bounds calculation on BPF_RSH · 4374f256
      Edward Cree 提交于
      Incorrect signed bounds were being computed.
      If the old upper signed bound was positive and the old lower signed bound was
      negative, this could cause the new upper signed bound to be too low,
      leading to security issues.
      
      Fixes: b03c9f9f ("bpf/verifier: track signed and unsigned min/max values")
      Reported-by: NJann Horn <jannh@google.com>
      Signed-off-by: NEdward Cree <ecree@solarflare.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      [jannh@google.com: changed description to reflect bug impact]
      Signed-off-by: NJann Horn <jannh@google.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      4374f256
  16. 19 12月, 2017 2 次提交
  17. 18 12月, 2017 6 次提交
    • A
      bpf: x64: add JIT support for multi-function programs · 1c2a088a
      Alexei Starovoitov 提交于
      Typical JIT does several passes over bpf instructions to
      compute total size and relative offsets of jumps and calls.
      With multitple bpf functions calling each other all relative calls
      will have invalid offsets intially therefore we need to additional
      last pass over the program to emit calls with correct offsets.
      For example in case of three bpf functions:
      main:
        call foo
        call bpf_map_lookup
        exit
      foo:
        call bar
        exit
      bar:
        exit
      
      We will call bpf_int_jit_compile() indepedently for main(), foo() and bar()
      x64 JIT typically does 4-5 passes to converge.
      After these initial passes the image for these 3 functions
      will be good except call targets, since start addresses of
      foo() and bar() are unknown when we were JITing main()
      (note that call bpf_map_lookup will be resolved properly
      during initial passes).
      Once start addresses of 3 functions are known we patch
      call_insn->imm to point to right functions and call
      bpf_int_jit_compile() again which needs only one pass.
      Additional safety checks are done to make sure this
      last pass doesn't produce image that is larger or smaller
      than previous pass.
      
      When constant blinding is on it's applied to all functions
      at the first pass, since doing it once again at the last
      pass can change size of the JITed code.
      
      Tested on x64 and arm64 hw with JIT on/off, blinding on/off.
      x64 jits bpf-to-bpf calls correctly while arm64 falls back to interpreter.
      All other JITs that support normal BPF_CALL will behave the same way
      since bpf-to-bpf call is equivalent to bpf-to-kernel call from
      JITs point of view.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      1c2a088a
    • A
      bpf: fix net.core.bpf_jit_enable race · 60b58afc
      Alexei Starovoitov 提交于
      global bpf_jit_enable variable is tested multiple times in JITs,
      blinding and verifier core. The malicious root can try to toggle
      it while loading the programs. This race condition was accounted
      for and there should be no issues, but it's safer to avoid
      this race condition.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      60b58afc
    • A
      bpf: add support for bpf_call to interpreter · 1ea47e01
      Alexei Starovoitov 提交于
      though bpf_call is still the same call instruction and
      calling convention 'bpf to bpf' and 'bpf to helper' is the same
      the interpreter has to oparate on 'struct bpf_insn *'.
      To distinguish these two cases add a kernel internal opcode and
      mark call insns with it.
      This opcode is seen by interpreter only. JITs will never see it.
      Also add tiny bit of debug code to aid interpreter debugging.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      1ea47e01
    • A
      bpf: teach verifier to recognize zero initialized stack · cc2b14d5
      Alexei Starovoitov 提交于
      programs with function calls are often passing various
      pointers via stack. When all calls are inlined llvm
      flattens stack accesses and optimizes away extra branches.
      When functions are not inlined it becomes the job of
      the verifier to recognize zero initialized stack to avoid
      exploring paths that program will not take.
      The following program would fail otherwise:
      
      ptr = &buffer_on_stack;
      *ptr = 0;
      ...
      func_call(.., ptr, ...) {
        if (..)
          *ptr = bpf_map_lookup();
      }
      ...
      if (*ptr != 0) {
        // Access (*ptr)->field is valid.
        // Without stack_zero tracking such (*ptr)->field access
        // will be rejected
      }
      
      since stack slots are no longer uniform invalid | spill | misc
      add liveness marking to all slots, but do it in 8 byte chunks.
      So if nothing was read or written in [fp-16, fp-9] range
      it will be marked as LIVE_NONE.
      If any byte in that range was read, it will be marked LIVE_READ
      and stacksafe() check will perform byte-by-byte verification.
      If all bytes in the range were written the slot will be
      marked as LIVE_WRITTEN.
      This significantly speeds up state equality comparison
      and reduces total number of states processed.
      
                          before   after
      bpf_lb-DLB_L3.o       2051    2003
      bpf_lb-DLB_L4.o       3287    3164
      bpf_lb-DUNKNOWN.o     1080    1080
      bpf_lxc-DDROP_ALL.o   24980   12361
      bpf_lxc-DUNKNOWN.o    34308   16605
      bpf_netdev.o          15404   10962
      bpf_overlay.o         7191    6679
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      cc2b14d5
    • A
      bpf: introduce function calls (verification) · f4d7e40a
      Alexei Starovoitov 提交于
      Allow arbitrary function calls from bpf function to another bpf function.
      
      To recognize such set of bpf functions the verifier does:
      1. runs control flow analysis to detect function boundaries
      2. proceeds with verification of all functions starting from main(root) function
      It recognizes that the stack of the caller can be accessed by the callee
      (if the caller passed a pointer to its stack to the callee) and the callee
      can store map_value and other pointers into the stack of the caller.
      3. keeps track of the stack_depth of each function to make sure that total
      stack depth is still less than 512 bytes
      4. disallows pointers to the callee stack to be stored into the caller stack,
      since they will be invalid as soon as the callee returns
      5. to reuse all of the existing state_pruning logic each function call
      is considered to be independent call from the verifier point of view.
      The verifier pretends to inline all function calls it sees are being called.
      It stores the callsite instruction index as part of the state to make sure
      that two calls to the same callee from two different places in the caller
      will be different from state pruning point of view
      6. more safety checks are added to liveness analysis
      
      Implementation details:
      . struct bpf_verifier_state is now consists of all stack frames that
        led to this function
      . struct bpf_func_state represent one stack frame. It consists of
        registers in the given frame and its stack
      . propagate_liveness() logic had a premature optimization where
        mark_reg_read() and mark_stack_slot_read() were manually inlined
        with loop iterating over parents for each register or stack slot.
        Undo this optimization to reuse more complex mark_*_read() logic
      . skip_callee() logic is not necessary from safety point of view,
        but without it mark_*_read() markings become too conservative,
        since after returning from the funciton call a read of r6-r9
        will incorrectly propagate the read marks into callee causing
        inefficient pruning later
      . mark_*_read() logic is now aware of control flow which makes it
        more complex. In the future the plan is to rewrite liveness
        to be hierarchical. So that liveness can be done within
        basic block only and control flow will be responsible for
        propagation of liveness information along cfg and between calls.
      . tail_calls and ld_abs insns are not allowed in the programs with
        bpf-to-bpf calls
      . returning stack pointers to the caller or storing them into stack
        frame of the caller is not allowed
      
      Testing:
      . no difference in cilium processed_insn numbers
      . large number of tests follows in next patches
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      f4d7e40a
    • A
      bpf: introduce function calls (function boundaries) · cc8b0b92
      Alexei Starovoitov 提交于
      Allow arbitrary function calls from bpf function to another bpf function.
      
      Since the beginning of bpf all bpf programs were represented as a single function
      and program authors were forced to use always_inline for all functions
      in their C code. That was causing llvm to unnecessary inflate the code size
      and forcing developers to move code to header files with little code reuse.
      
      With a bit of additional complexity teach verifier to recognize
      arbitrary function calls from one bpf function to another as long as
      all of functions are presented to the verifier as a single bpf program.
      New program layout:
      r6 = r1    // some code
      ..
      r1 = ..    // arg1
      r2 = ..    // arg2
      call pc+1  // function call pc-relative
      exit
      .. = r1    // access arg1
      .. = r2    // access arg2
      ..
      call pc+20 // second level of function call
      ...
      
      It allows for better optimized code and finally allows to introduce
      the core bpf libraries that can be reused in different projects,
      since programs are no longer limited by single elf file.
      With function calls bpf can be compiled into multiple .o files.
      
      This patch is the first step. It detects programs that contain
      multiple functions and checks that calls between them are valid.
      It splits the sequence of bpf instructions (one program) into a set
      of bpf functions that call each other. Calls to only known
      functions are allowed. In the future the verifier may allow
      calls to unresolved functions and will do dynamic linking.
      This logic supports statically linked bpf functions only.
      
      Such function boundary detection could have been done as part of
      control flow graph building in check_cfg(), but it's cleaner to
      separate function boundary detection vs control flow checks within
      a subprogram (function) into logically indepedent steps.
      Follow up patches may split check_cfg() further, but not check_subprogs().
      
      Only allow bpf-to-bpf calls for root only and for non-hw-offloaded programs.
      These restrictions can be relaxed in the future.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      cc8b0b92