1. 28 12月, 2017 2 次提交
  2. 24 12月, 2017 1 次提交
    • G
      bpf: fix stacksafe exploration when comparing states · fd05e57b
      Gianluca Borello 提交于
      Commit cc2b14d5 ("bpf: teach verifier to recognize zero initialized
      stack") introduced a very relaxed check when comparing stacks of different
      states, effectively returning a positive result in many cases where it
      shouldn't.
      
      This can create problems in cases such as this following C pseudocode:
      
      long var;
      long *x = bpf_map_lookup(...);
      if (!x)
              return;
      
      if (*x != 0xbeef)
              var = 0;
      else
              var = 1;
      
      /* This is the key part, calling a helper causes an explored state
       * to be saved with the information that "var" is on the stack as
       * STACK_ZERO, since the helper is first met by the verifier after
       * the "var = 0" assignment. This state will however be wrongly used
       * also for the "var = 1" case, so the verifier assumes "var" is always
       * 0 and will replace the NULL assignment with nops, because the
       * search pruning prevents it from exploring the faulty branch.
       */
      bpf_ktime_get_ns();
      
      if (var)
              *(long *)0 = 0xbeef;
      
      Fix the issue by making sure that the stack is fully explored before
      returning a positive comparison result.
      
      Also attach a couple tests that highlight the bad behavior. In the first
      test, without this fix instructions 16 and 17 are replaced with nops
      instead of being rejected by the verifier.
      
      The second test, instead, allows a program to make a potentially illegal
      read from the stack.
      
      Fixes: cc2b14d5 ("bpf: teach verifier to recognize zero initialized stack")
      Signed-off-by: NGianluca Borello <g.borello@gmail.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      fd05e57b
  3. 21 12月, 2017 11 次提交
    • D
      bpf: allow for correlation of maps and helpers in dump · 7105e828
      Daniel Borkmann 提交于
      Currently a dump of an xlated prog (post verifier stage) doesn't
      correlate used helpers as well as maps. The prog info lists
      involved map ids, however there's no correlation of where in the
      program they are used as of today. Likewise, bpftool does not
      correlate helper calls with the target functions.
      
      The latter can be done w/o any kernel changes through kallsyms,
      and also has the advantage that this works with inlined helpers
      and BPF calls.
      
      Example, via interpreter:
      
        # tc filter show dev foo ingress
        filter protocol all pref 49152 bpf chain 0
        filter protocol all pref 49152 bpf chain 0 handle 0x1 foo.o:[ingress] \
                            direct-action not_in_hw id 1 tag c74773051b364165   <-- prog id:1
      
        * Output before patch (calls/maps remain unclear):
      
        # bpftool prog dump xlated id 1             <-- dump prog id:1
         0: (b7) r1 = 2
         1: (63) *(u32 *)(r10 -4) = r1
         2: (bf) r2 = r10
         3: (07) r2 += -4
         4: (18) r1 = 0xffff95c47a8d4800
         6: (85) call unknown#73040
         7: (15) if r0 == 0x0 goto pc+18
         8: (bf) r2 = r10
         9: (07) r2 += -4
        10: (bf) r1 = r0
        11: (85) call unknown#73040
        12: (15) if r0 == 0x0 goto pc+23
        [...]
      
        * Output after patch:
      
        # bpftool prog dump xlated id 1
         0: (b7) r1 = 2
         1: (63) *(u32 *)(r10 -4) = r1
         2: (bf) r2 = r10
         3: (07) r2 += -4
         4: (18) r1 = map[id:2]                     <-- map id:2
         6: (85) call bpf_map_lookup_elem#73424     <-- helper call
         7: (15) if r0 == 0x0 goto pc+18
         8: (bf) r2 = r10
         9: (07) r2 += -4
        10: (bf) r1 = r0
        11: (85) call bpf_map_lookup_elem#73424
        12: (15) if r0 == 0x0 goto pc+23
        [...]
      
        # bpftool map show id 2                     <-- show/dump/etc map id:2
        2: hash_of_maps  flags 0x0
              key 4B  value 4B  max_entries 3  memlock 4096B
      
      Example, JITed, same prog:
      
        # tc filter show dev foo ingress
        filter protocol all pref 49152 bpf chain 0
        filter protocol all pref 49152 bpf chain 0 handle 0x1 foo.o:[ingress] \
                        direct-action not_in_hw id 3 tag c74773051b364165 jited
      
        # bpftool prog show id 3
        3: sched_cls  tag c74773051b364165
              loaded_at Dec 19/13:48  uid 0
              xlated 384B  jited 257B  memlock 4096B  map_ids 2
      
        # bpftool prog dump xlated id 3
         0: (b7) r1 = 2
         1: (63) *(u32 *)(r10 -4) = r1
         2: (bf) r2 = r10
         3: (07) r2 += -4
         4: (18) r1 = map[id:2]                      <-- map id:2
         6: (85) call __htab_map_lookup_elem#77408   <-+ inlined rewrite
         7: (15) if r0 == 0x0 goto pc+2                |
         8: (07) r0 += 56                              |
         9: (79) r0 = *(u64 *)(r0 +0)                <-+
        10: (15) if r0 == 0x0 goto pc+24
        11: (bf) r2 = r10
        12: (07) r2 += -4
        [...]
      
      Example, same prog, but kallsyms disabled (in that case we are
      also not allowed to pass any relative offsets, etc, so prog
      becomes pointer sanitized on dump):
      
        # sysctl kernel.kptr_restrict=2
        kernel.kptr_restrict = 2
      
        # bpftool prog dump xlated id 3
         0: (b7) r1 = 2
         1: (63) *(u32 *)(r10 -4) = r1
         2: (bf) r2 = r10
         3: (07) r2 += -4
         4: (18) r1 = map[id:2]
         6: (85) call bpf_unspec#0
         7: (15) if r0 == 0x0 goto pc+2
        [...]
      
      Example, BPF calls via interpreter:
      
        # bpftool prog dump xlated id 1
         0: (85) call pc+2#__bpf_prog_run_args32
         1: (b7) r0 = 1
         2: (95) exit
         3: (b7) r0 = 2
         4: (95) exit
      
      Example, BPF calls via JIT:
      
        # sysctl net.core.bpf_jit_enable=1
        net.core.bpf_jit_enable = 1
        # sysctl net.core.bpf_jit_kallsyms=1
        net.core.bpf_jit_kallsyms = 1
      
        # bpftool prog dump xlated id 1
         0: (85) call pc+2#bpf_prog_3b185187f1855c4c_F
         1: (b7) r0 = 1
         2: (95) exit
         3: (b7) r0 = 2
         4: (95) exit
      
      And finally, an example for tail calls that is now working
      as well wrt correlation:
      
        # bpftool prog dump xlated id 2
        [...]
        10: (b7) r2 = 8
        11: (85) call bpf_trace_printk#-41312
        12: (bf) r1 = r6
        13: (18) r2 = map[id:1]
        15: (b7) r3 = 0
        16: (85) call bpf_tail_call#12
        17: (b7) r1 = 42
        18: (6b) *(u16 *)(r6 +46) = r1
        19: (b7) r0 = 0
        20: (95) exit
      
        # bpftool map show id 1
        1: prog_array  flags 0x0
              key 4B  value 4B  max_entries 1  memlock 4096B
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      7105e828
    • D
      bpf: fix kallsyms handling for subprogs · 4f74d809
      Daniel Borkmann 提交于
      Right now kallsyms handling is not working with JITed subprogs.
      The reason is that when in 1c2a088a ("bpf: x64: add JIT support
      for multi-function programs") in jit_subprogs() they are passed
      to bpf_prog_kallsyms_add(), then their prog type is 0, which BPF
      core will think it's a cBPF program as only cBPF programs have a
      0 type. Thus, they need to inherit the type from the main prog.
      
      Once that is fixed, they are indeed added to the BPF kallsyms
      infra, but their tag is 0. Therefore, since intention is to add
      them as bpf_prog_F_<tag>, we need to pass them to bpf_prog_calc_tag()
      first. And once this is resolved, there is a use-after-free on
      prog cleanup: we remove the kallsyms entry from the main prog,
      later walk all subprogs and call bpf_jit_free() on them. However,
      the kallsyms linkage was never released on them. Thus, do that
      for all subprogs right in __bpf_prog_put() when refcount hits 0.
      
      Fixes: 1c2a088a ("bpf: x64: add JIT support for multi-function programs")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      4f74d809
    • A
      bpf: do not allow root to mangle valid pointers · 82abbf8d
      Alexei Starovoitov 提交于
      Do not allow root to convert valid pointers into unknown scalars.
      In particular disallow:
       ptr &= reg
       ptr <<= reg
       ptr += ptr
      and explicitly allow:
       ptr -= ptr
      since pkt_end - pkt == length
      
      1.
      This minimizes amount of address leaks root can do.
      In the future may need to further tighten the leaks with kptr_restrict.
      
      2.
      If program has such pointer math it's likely a user mistake and
      when verifier complains about it right away instead of many instructions
      later on invalid memory access it's easier for users to fix their progs.
      
      3.
      when register holding a pointer cannot change to scalar it allows JITs to
      optimize better. Like 32-bit archs could use single register for pointers
      instead of a pair required to hold 64-bit scalars.
      
      4.
      reduces architecture dependent behavior. Since code:
      r1 = r10;
      r1 &= 0xff;
      if (r1 ...)
      will behave differently arm64 vs x64 and offloaded vs native.
      
      A significant chunk of ptr mangling was allowed by
      commit f1174f77 ("bpf/verifier: rework value tracking")
      yet some of it was allowed even earlier.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      82abbf8d
    • A
      bpf: fix integer overflows · bb7f0f98
      Alexei Starovoitov 提交于
      There were various issues related to the limited size of integers used in
      the verifier:
       - `off + size` overflow in __check_map_access()
       - `off + reg->off` overflow in check_mem_access()
       - `off + reg->var_off.value` overflow or 32-bit truncation of
         `reg->var_off.value` in check_mem_access()
       - 32-bit truncation in check_stack_boundary()
      
      Make sure that any integer math cannot overflow by not allowing
      pointer math with large values.
      
      Also reduce the scope of "scalar op scalar" tracking.
      
      Fixes: f1174f77 ("bpf/verifier: rework value tracking")
      Reported-by: NJann Horn <jannh@google.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      bb7f0f98
    • J
      bpf: don't prune branches when a scalar is replaced with a pointer · 179d1c56
      Jann Horn 提交于
      This could be made safe by passing through a reference to env and checking
      for env->allow_ptr_leaks, but it would only work one way and is probably
      not worth the hassle - not doing it will not directly lead to program
      rejection.
      
      Fixes: f1174f77 ("bpf/verifier: rework value tracking")
      Signed-off-by: NJann Horn <jannh@google.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      179d1c56
    • J
      bpf: force strict alignment checks for stack pointers · a5ec6ae1
      Jann Horn 提交于
      Force strict alignment checks for stack pointers because the tracking of
      stack spills relies on it; unaligned stack accesses can lead to corruption
      of spilled registers, which is exploitable.
      
      Fixes: f1174f77 ("bpf/verifier: rework value tracking")
      Signed-off-by: NJann Horn <jannh@google.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      a5ec6ae1
    • J
      bpf: fix missing error return in check_stack_boundary() · ea25f914
      Jann Horn 提交于
      Prevent indirect stack accesses at non-constant addresses, which would
      permit reading and corrupting spilled pointers.
      
      Fixes: f1174f77 ("bpf/verifier: rework value tracking")
      Signed-off-by: NJann Horn <jannh@google.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      ea25f914
    • J
      bpf: fix 32-bit ALU op verification · 468f6eaf
      Jann Horn 提交于
      32-bit ALU ops operate on 32-bit values and have 32-bit outputs.
      Adjust the verifier accordingly.
      
      Fixes: f1174f77 ("bpf/verifier: rework value tracking")
      Signed-off-by: NJann Horn <jannh@google.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      468f6eaf
    • J
      bpf: fix incorrect tracking of register size truncation · 0c17d1d2
      Jann Horn 提交于
      Properly handle register truncation to a smaller size.
      
      The old code first mirrors the clearing of the high 32 bits in the bitwise
      tristate representation, which is correct. But then, it computes the new
      arithmetic bounds as the intersection between the old arithmetic bounds and
      the bounds resulting from the bitwise tristate representation. Therefore,
      when coerce_reg_to_32() is called on a number with bounds
      [0xffff'fff8, 0x1'0000'0007], the verifier computes
      [0xffff'fff8, 0xffff'ffff] as bounds of the truncated number.
      This is incorrect: The truncated number could also be in the range [0, 7],
      and no meaningful arithmetic bounds can be computed in that case apart from
      the obvious [0, 0xffff'ffff].
      
      Starting with v4.14, this is exploitable by unprivileged users as long as
      the unprivileged_bpf_disabled sysctl isn't set.
      
      Debian assigned CVE-2017-16996 for this issue.
      
      v2:
       - flip the mask during arithmetic bounds calculation (Ben Hutchings)
      v3:
       - add CVE number (Ben Hutchings)
      
      Fixes: b03c9f9f ("bpf/verifier: track signed and unsigned min/max values")
      Signed-off-by: NJann Horn <jannh@google.com>
      Acked-by: NEdward Cree <ecree@solarflare.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      0c17d1d2
    • J
      bpf: fix incorrect sign extension in check_alu_op() · 95a762e2
      Jann Horn 提交于
      Distinguish between
      BPF_ALU64|BPF_MOV|BPF_K (load 32-bit immediate, sign-extended to 64-bit)
      and BPF_ALU|BPF_MOV|BPF_K (load 32-bit immediate, zero-padded to 64-bit);
      only perform sign extension in the first case.
      
      Starting with v4.14, this is exploitable by unprivileged users as long as
      the unprivileged_bpf_disabled sysctl isn't set.
      
      Debian assigned CVE-2017-16995 for this issue.
      
      v3:
       - add CVE number (Ben Hutchings)
      
      Fixes: 48461135 ("bpf: allow access into map value arrays")
      Signed-off-by: NJann Horn <jannh@google.com>
      Acked-by: NEdward Cree <ecree@solarflare.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      95a762e2
    • E
      bpf/verifier: fix bounds calculation on BPF_RSH · 4374f256
      Edward Cree 提交于
      Incorrect signed bounds were being computed.
      If the old upper signed bound was positive and the old lower signed bound was
      negative, this could cause the new upper signed bound to be too low,
      leading to security issues.
      
      Fixes: b03c9f9f ("bpf/verifier: track signed and unsigned min/max values")
      Reported-by: NJann Horn <jannh@google.com>
      Signed-off-by: NEdward Cree <ecree@solarflare.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      [jannh@google.com: changed description to reflect bug impact]
      Signed-off-by: NJann Horn <jannh@google.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      4374f256
  4. 19 12月, 2017 3 次提交
  5. 18 12月, 2017 7 次提交
    • J
      trace: reenable preemption if we modify the ip · 46df3d20
      Josef Bacik 提交于
      Things got moved around between the original bpf_override_return patches
      and the final version, and now the ftrace kprobe dispatcher assumes if
      you modified the ip that you also enabled preemption.  Make a comment of
      this and enable preemption, this fixes the lockdep splat that happened
      when using this feature.
      
      Fixes: 9802d865 ("bpf: add a bpf_override_function helper")
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      46df3d20
    • A
      bpf: x64: add JIT support for multi-function programs · 1c2a088a
      Alexei Starovoitov 提交于
      Typical JIT does several passes over bpf instructions to
      compute total size and relative offsets of jumps and calls.
      With multitple bpf functions calling each other all relative calls
      will have invalid offsets intially therefore we need to additional
      last pass over the program to emit calls with correct offsets.
      For example in case of three bpf functions:
      main:
        call foo
        call bpf_map_lookup
        exit
      foo:
        call bar
        exit
      bar:
        exit
      
      We will call bpf_int_jit_compile() indepedently for main(), foo() and bar()
      x64 JIT typically does 4-5 passes to converge.
      After these initial passes the image for these 3 functions
      will be good except call targets, since start addresses of
      foo() and bar() are unknown when we were JITing main()
      (note that call bpf_map_lookup will be resolved properly
      during initial passes).
      Once start addresses of 3 functions are known we patch
      call_insn->imm to point to right functions and call
      bpf_int_jit_compile() again which needs only one pass.
      Additional safety checks are done to make sure this
      last pass doesn't produce image that is larger or smaller
      than previous pass.
      
      When constant blinding is on it's applied to all functions
      at the first pass, since doing it once again at the last
      pass can change size of the JITed code.
      
      Tested on x64 and arm64 hw with JIT on/off, blinding on/off.
      x64 jits bpf-to-bpf calls correctly while arm64 falls back to interpreter.
      All other JITs that support normal BPF_CALL will behave the same way
      since bpf-to-bpf call is equivalent to bpf-to-kernel call from
      JITs point of view.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      1c2a088a
    • A
      bpf: fix net.core.bpf_jit_enable race · 60b58afc
      Alexei Starovoitov 提交于
      global bpf_jit_enable variable is tested multiple times in JITs,
      blinding and verifier core. The malicious root can try to toggle
      it while loading the programs. This race condition was accounted
      for and there should be no issues, but it's safer to avoid
      this race condition.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      60b58afc
    • A
      bpf: add support for bpf_call to interpreter · 1ea47e01
      Alexei Starovoitov 提交于
      though bpf_call is still the same call instruction and
      calling convention 'bpf to bpf' and 'bpf to helper' is the same
      the interpreter has to oparate on 'struct bpf_insn *'.
      To distinguish these two cases add a kernel internal opcode and
      mark call insns with it.
      This opcode is seen by interpreter only. JITs will never see it.
      Also add tiny bit of debug code to aid interpreter debugging.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      1ea47e01
    • A
      bpf: teach verifier to recognize zero initialized stack · cc2b14d5
      Alexei Starovoitov 提交于
      programs with function calls are often passing various
      pointers via stack. When all calls are inlined llvm
      flattens stack accesses and optimizes away extra branches.
      When functions are not inlined it becomes the job of
      the verifier to recognize zero initialized stack to avoid
      exploring paths that program will not take.
      The following program would fail otherwise:
      
      ptr = &buffer_on_stack;
      *ptr = 0;
      ...
      func_call(.., ptr, ...) {
        if (..)
          *ptr = bpf_map_lookup();
      }
      ...
      if (*ptr != 0) {
        // Access (*ptr)->field is valid.
        // Without stack_zero tracking such (*ptr)->field access
        // will be rejected
      }
      
      since stack slots are no longer uniform invalid | spill | misc
      add liveness marking to all slots, but do it in 8 byte chunks.
      So if nothing was read or written in [fp-16, fp-9] range
      it will be marked as LIVE_NONE.
      If any byte in that range was read, it will be marked LIVE_READ
      and stacksafe() check will perform byte-by-byte verification.
      If all bytes in the range were written the slot will be
      marked as LIVE_WRITTEN.
      This significantly speeds up state equality comparison
      and reduces total number of states processed.
      
                          before   after
      bpf_lb-DLB_L3.o       2051    2003
      bpf_lb-DLB_L4.o       3287    3164
      bpf_lb-DUNKNOWN.o     1080    1080
      bpf_lxc-DDROP_ALL.o   24980   12361
      bpf_lxc-DUNKNOWN.o    34308   16605
      bpf_netdev.o          15404   10962
      bpf_overlay.o         7191    6679
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      cc2b14d5
    • A
      bpf: introduce function calls (verification) · f4d7e40a
      Alexei Starovoitov 提交于
      Allow arbitrary function calls from bpf function to another bpf function.
      
      To recognize such set of bpf functions the verifier does:
      1. runs control flow analysis to detect function boundaries
      2. proceeds with verification of all functions starting from main(root) function
      It recognizes that the stack of the caller can be accessed by the callee
      (if the caller passed a pointer to its stack to the callee) and the callee
      can store map_value and other pointers into the stack of the caller.
      3. keeps track of the stack_depth of each function to make sure that total
      stack depth is still less than 512 bytes
      4. disallows pointers to the callee stack to be stored into the caller stack,
      since they will be invalid as soon as the callee returns
      5. to reuse all of the existing state_pruning logic each function call
      is considered to be independent call from the verifier point of view.
      The verifier pretends to inline all function calls it sees are being called.
      It stores the callsite instruction index as part of the state to make sure
      that two calls to the same callee from two different places in the caller
      will be different from state pruning point of view
      6. more safety checks are added to liveness analysis
      
      Implementation details:
      . struct bpf_verifier_state is now consists of all stack frames that
        led to this function
      . struct bpf_func_state represent one stack frame. It consists of
        registers in the given frame and its stack
      . propagate_liveness() logic had a premature optimization where
        mark_reg_read() and mark_stack_slot_read() were manually inlined
        with loop iterating over parents for each register or stack slot.
        Undo this optimization to reuse more complex mark_*_read() logic
      . skip_callee() logic is not necessary from safety point of view,
        but without it mark_*_read() markings become too conservative,
        since after returning from the funciton call a read of r6-r9
        will incorrectly propagate the read marks into callee causing
        inefficient pruning later
      . mark_*_read() logic is now aware of control flow which makes it
        more complex. In the future the plan is to rewrite liveness
        to be hierarchical. So that liveness can be done within
        basic block only and control flow will be responsible for
        propagation of liveness information along cfg and between calls.
      . tail_calls and ld_abs insns are not allowed in the programs with
        bpf-to-bpf calls
      . returning stack pointers to the caller or storing them into stack
        frame of the caller is not allowed
      
      Testing:
      . no difference in cilium processed_insn numbers
      . large number of tests follows in next patches
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      f4d7e40a
    • A
      bpf: introduce function calls (function boundaries) · cc8b0b92
      Alexei Starovoitov 提交于
      Allow arbitrary function calls from bpf function to another bpf function.
      
      Since the beginning of bpf all bpf programs were represented as a single function
      and program authors were forced to use always_inline for all functions
      in their C code. That was causing llvm to unnecessary inflate the code size
      and forcing developers to move code to header files with little code reuse.
      
      With a bit of additional complexity teach verifier to recognize
      arbitrary function calls from one bpf function to another as long as
      all of functions are presented to the verifier as a single bpf program.
      New program layout:
      r6 = r1    // some code
      ..
      r1 = ..    // arg1
      r2 = ..    // arg2
      call pc+1  // function call pc-relative
      exit
      .. = r1    // access arg1
      .. = r2    // access arg2
      ..
      call pc+20 // second level of function call
      ...
      
      It allows for better optimized code and finally allows to introduce
      the core bpf libraries that can be reused in different projects,
      since programs are no longer limited by single elf file.
      With function calls bpf can be compiled into multiple .o files.
      
      This patch is the first step. It detects programs that contain
      multiple functions and checks that calls between them are valid.
      It splits the sequence of bpf instructions (one program) into a set
      of bpf functions that call each other. Calls to only known
      functions are allowed. In the future the verifier may allow
      calls to unresolved functions and will do dynamic linking.
      This logic supports statically linked bpf functions only.
      
      Such function boundary detection could have been done as part of
      control flow graph building in check_cfg(), but it's cleaner to
      separate function boundary detection vs control flow checks within
      a subprogram (function) into logically indepedent steps.
      Follow up patches may split check_cfg() further, but not check_subprogs().
      
      Only allow bpf-to-bpf calls for root only and for non-hw-offloaded programs.
      These restrictions can be relaxed in the future.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      cc8b0b92
  6. 17 12月, 2017 1 次提交
  7. 16 12月, 2017 1 次提交
    • D
      bpf: guarantee r1 to be ctx in case of bpf_helper_changes_pkt_data · 04514d13
      Daniel Borkmann 提交于
      Some JITs don't cache skb context on stack in prologue, so when
      LD_ABS/IND is used and helper calls yield bpf_helper_changes_pkt_data()
      as true, then they temporarily save/restore skb pointer. However,
      the assumption that skb always has to be in r1 is a bit of a
      gamble. Right now it turned out to be true for all helpers listed
      in bpf_helper_changes_pkt_data(), but lets enforce that from verifier
      side, so that we make this a guarantee and bail out if the func
      proto is misconfigured in future helpers.
      
      In case of BPF helper calls from cBPF, bpf_helper_changes_pkt_data()
      is completely unrelevant here (since cBPF is context read-only) and
      therefore always false.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      04514d13
  8. 15 12月, 2017 6 次提交
  9. 14 12月, 2017 1 次提交
    • Y
      bpf/tracing: fix kernel/events/core.c compilation error · f4e2298e
      Yonghong Song 提交于
      Commit f371b304 ("bpf/tracing: allow user space to
      query prog array on the same tp") introduced a perf
      ioctl command to query prog array attached to the
      same perf tracepoint. The commit introduced a
      compilation error under certain config conditions, e.g.,
        (1). CONFIG_BPF_SYSCALL is not defined, or
        (2). CONFIG_TRACING is defined but neither CONFIG_UPROBE_EVENTS
             nor CONFIG_KPROBE_EVENTS is defined.
      
      Error message:
        kernel/events/core.o: In function `perf_ioctl':
        core.c:(.text+0x98c4): undefined reference to `bpf_event_query_prog_array'
      
      This patch fixed this error by guarding the real definition under
      CONFIG_BPF_EVENTS and provided static inline dummy function
      if CONFIG_BPF_EVENTS was not defined.
      It renamed the function from bpf_event_query_prog_array to
      perf_event_query_prog_array and moved the definition from linux/bpf.h
      to linux/trace_events.h so the definition is in proximity to
      other prog_array related functions.
      
      Fixes: f371b304 ("bpf/tracing: allow user space to query prog array on the same tp")
      Reported-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      f4e2298e
  10. 13 12月, 2017 5 次提交
    • E
      bpf: add schedule points to map alloc/free · 9147efcb
      Eric Dumazet 提交于
      While using large percpu maps, htab_map_alloc() can hold
      cpu for hundreds of ms.
      
      This patch adds cond_resched() calls to percpu alloc/free
      call sites, all running in process context.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      9147efcb
    • D
      bpf: fix corruption on concurrent perf_event_output calls · 283ca526
      Daniel Borkmann 提交于
      When tracing and networking programs are both attached in the
      system and both use event-output helpers that eventually call
      into perf_event_output(), then we could end up in a situation
      where the tracing attached program runs in user context while
      a cls_bpf program is triggered on that same CPU out of softirq
      context.
      
      Since both rely on the same per-cpu perf_sample_data, we could
      potentially corrupt it. This can only ever happen in a combination
      of the two types; all tracing programs use a bpf_prog_active
      counter to bail out in case a program is already running on
      that CPU out of a different context. XDP and cls_bpf programs
      by themselves don't have this issue as they run in the same
      context only. Therefore, split both perf_sample_data so they
      cannot be accessed from each other.
      
      Fixes: 20b9d7ac ("bpf: avoid excessive stack usage for perf_sample_data")
      Reported-by: NAlexei Starovoitov <ast@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Tested-by: NSong Liu <songliubraving@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      283ca526
    • J
      bpf: add a bpf_override_function helper · 9802d865
      Josef Bacik 提交于
      Error injection is sloppy and very ad-hoc.  BPF could fill this niche
      perfectly with it's kprobe functionality.  We could make sure errors are
      only triggered in specific call chains that we care about with very
      specific situations.  Accomplish this with the bpf_override_funciton
      helper.  This will modify the probe'd callers return value to the
      specified value and set the PC to an override function that simply
      returns, bypassing the originally probed function.  This gives us a nice
      clean way to implement systematic error injection for all of our code
      paths.
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      9802d865
    • J
      add infrastructure for tagging functions as error injectable · 92ace999
      Josef Bacik 提交于
      Using BPF we can override kprob'ed functions and return arbitrary
      values.  Obviously this can be a bit unsafe, so make this feature opt-in
      for functions.  Simply tag a function with KPROBE_ERROR_INJECT_SYMBOL in
      order to give BPF access to that function for error injection purposes.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      92ace999
    • Y
      bpf/tracing: allow user space to query prog array on the same tp · f371b304
      Yonghong Song 提交于
      Commit e87c6bc3 ("bpf: permit multiple bpf attachments
      for a single perf event") added support to attach multiple
      bpf programs to a single perf event.
      Although this provides flexibility, users may want to know
      what other bpf programs attached to the same tp interface.
      Besides getting visibility for the underlying bpf system,
      such information may also help consolidate multiple bpf programs,
      understand potential performance issues due to a large array,
      and debug (e.g., one bpf program which overwrites return code
      may impact subsequent program results).
      
      Commit 2541517c ("tracing, perf: Implement BPF programs
      attached to kprobes") utilized the existing perf ioctl
      interface and added the command PERF_EVENT_IOC_SET_BPF
      to attach a bpf program to a tracepoint. This patch adds a new
      ioctl command, given a perf event fd, to query the bpf program
      array attached to the same perf tracepoint event.
      
      The new uapi ioctl command:
        PERF_EVENT_IOC_QUERY_BPF
      
      The new uapi/linux/perf_event.h structure:
        struct perf_event_query_bpf {
             __u32	ids_len;
             __u32	prog_cnt;
             __u32	ids[0];
        };
      
      User space provides buffer "ids" for kernel to copy to.
      When returning from the kernel, the number of available
      programs in the array is set in "prog_cnt".
      
      The usage:
        struct perf_event_query_bpf *query =
          malloc(sizeof(*query) + sizeof(u32) * ids_len);
        query.ids_len = ids_len;
        err = ioctl(pmu_efd, PERF_EVENT_IOC_QUERY_BPF, query);
        if (err == 0) {
          /* query.prog_cnt is the number of available progs,
           * number of progs in ids: (ids_len == 0) ? 0 : query.prog_cnt
           */
        } else if (errno == ENOSPC) {
          /* query.ids_len number of progs copied,
           * query.prog_cnt is the number of available progs
           */
        } else {
            /* other errors */
        }
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      f371b304
  11. 12 12月, 2017 2 次提交
    • I
      locking/lockdep: Remove the cross-release locking checks · e966eaee
      Ingo Molnar 提交于
      This code (CONFIG_LOCKDEP_CROSSRELEASE=y and CONFIG_LOCKDEP_COMPLETIONS=y),
      while it found a number of old bugs initially, was also causing too many
      false positives that caused people to disable lockdep - which is arguably
      a worse overall outcome.
      
      If we disable cross-release by default but keep the code upstream then
      in practice the most likely outcome is that we'll allow the situation
      to degrade gradually, by allowing entropy to introduce more and more
      false positives, until it overwhelms maintenance capacity.
      
      Another bad side effect was that people were trying to work around
      the false positives by uglifying/complicating unrelated code. There's
      a marked difference between annotating locking operations and
      uglifying good code just due to bad lock debugging code ...
      
      This gradual decrease in quality happened to a number of debugging
      facilities in the kernel, and lockdep is pretty complex already,
      so we cannot risk this outcome.
      
      Either cross-release checking can be done right with no false positives,
      or it should not be included in the upstream kernel.
      
      ( Note that it might make sense to maintain it out of tree and go through
        the false positives every now and then and see whether new bugs were
        introduced. )
      
      Cc: Byungchul Park <byungchul.park@lge.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      e966eaee
    • W
      locking/core: Remove break_lock field when CONFIG_GENERIC_LOCKBREAK=y · d89c7035
      Will Deacon 提交于
      When CONFIG_GENERIC_LOCKBEAK=y, locking structures grow an extra int ->break_lock
      field which is used to implement raw_spin_is_contended() by setting the field
      to 1 when waiting on a lock and clearing it to zero when holding a lock.
      However, there are a few problems with this approach:
      
        - There is a write-write race between a CPU successfully taking the lock
          (and subsequently writing break_lock = 0) and a waiter waiting on
          the lock (and subsequently writing break_lock = 1). This could result
          in a contended lock being reported as uncontended and vice-versa.
      
        - On machines with store buffers, nothing guarantees that the writes
          to break_lock are visible to other CPUs at any particular time.
      
        - READ_ONCE/WRITE_ONCE are not used, so the field is potentially
          susceptible to harmful compiler optimisations,
      
      Consequently, the usefulness of this field is unclear and we'd be better off
      removing it and allowing architectures to implement raw_spin_is_contended() by
      providing a definition of arch_spin_is_contended(), as they can when
      CONFIG_GENERIC_LOCKBREAK=n.
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Sebastian Ott <sebott@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1511894539-7988-3-git-send-email-will.deacon@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d89c7035