1. 03 11月, 2015 5 次提交
    • D
      bpf: add support for persistent maps/progs · b2197755
      Daniel Borkmann 提交于
      This work adds support for "persistent" eBPF maps/programs. The term
      "persistent" is to be understood that maps/programs have a facility
      that lets them survive process termination. This is desired by various
      eBPF subsystem users.
      
      Just to name one example: tc classifier/action. Whenever tc parses
      the ELF object, extracts and loads maps/progs into the kernel, these
      file descriptors will be out of reach after the tc instance exits.
      So a subsequent tc invocation won't be able to access/relocate on this
      resource, and therefore maps cannot easily be shared, f.e. between the
      ingress and egress networking data path.
      
      The current workaround is that Unix domain sockets (UDS) need to be
      instrumented in order to pass the created eBPF map/program file
      descriptors to a third party management daemon through UDS' socket
      passing facility. This makes it a bit complicated to deploy shared
      eBPF maps or programs (programs f.e. for tail calls) among various
      processes.
      
      We've been brainstorming on how we could tackle this issue and various
      approches have been tried out so far, which can be read up further in
      the below reference.
      
      The architecture we eventually ended up with is a minimal file system
      that can hold map/prog objects. The file system is a per mount namespace
      singleton, and the default mount point is /sys/fs/bpf/. Any subsequent
      mounts within a given namespace will point to the same instance. The
      file system allows for creating a user-defined directory structure.
      The objects for maps/progs are created/fetched through bpf(2) with
      two new commands (BPF_OBJ_PIN/BPF_OBJ_GET). I.e. a bpf file descriptor
      along with a pathname is being passed to bpf(2) that in turn creates
      (we call it eBPF object pinning) the file system nodes. Only the pathname
      is being passed to bpf(2) for getting a new BPF file descriptor to an
      existing node. The user can use that to access maps and progs later on,
      through bpf(2). Removal of file system nodes is being managed through
      normal VFS functions such as unlink(2), etc. The file system code is
      kept to a very minimum and can be further extended later on.
      
      The next step I'm working on is to add dump eBPF map/prog commands
      to bpf(2), so that a specification from a given file descriptor can
      be retrieved. This can be used by things like CRIU but also applications
      can inspect the meta data after calling BPF_OBJ_GET.
      
      Big thanks also to Alexei and Hannes who significantly contributed
      in the design discussion that eventually let us end up with this
      architecture here.
      
      Reference: https://lkml.org/lkml/2015/10/15/925Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b2197755
    • D
      bpf: consolidate bpf_prog_put{, _rcu} dismantle paths · e9d8afa9
      Daniel Borkmann 提交于
      We currently have duplicated cleanup code in bpf_prog_put() and
      bpf_prog_put_rcu() cleanup paths. Back then we decided that it was
      not worth it to make it a common helper called by both, but with
      the recent addition of resource charging, we could have avoided
      the fix in commit ac00737f ("bpf: Need to call bpf_prog_uncharge_memlock
      from bpf_prog_put") if we would have had only a single, common path.
      We can simplify it further by assigning aux->prog only once during
      allocation time.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e9d8afa9
    • D
      bpf: align and clean bpf_{map,prog}_get helpers · c2101297
      Daniel Borkmann 提交于
      Add a bpf_map_get() function that we're going to use later on and
      align/clean the remaining helpers a bit so that we have them a bit
      more consistent:
      
        - __bpf_map_get() and __bpf_prog_get() that both work on the fd
          struct, check whether the descriptor is eBPF and return the
          pointer to the map/prog stored in the private data.
      
          Also, we can return f.file->private_data directly, the function
          signature is enough of a documentation already.
      
        - bpf_map_get() and bpf_prog_get() that both work on u32 user fd,
          call their respective __bpf_map_get()/__bpf_prog_get() variants,
          and take a reference.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c2101297
    • D
      bpf: abstract anon_inode_getfd invocations · aa79781b
      Daniel Borkmann 提交于
      Since we're going to use anon_inode_getfd() invocations in more than just
      the current places, make a helper function for both, so that we only need
      to pass a map/prog pointer to the helper itself in order to get a fd. The
      new helpers are called bpf_map_new_fd() and bpf_prog_new_fd().
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aa79781b
    • Y
      bpf: convert hashtab lock to raw lock · ac00881f
      Yang Shi 提交于
      When running bpf samples on rt kernel, it reports the below warning:
      
      BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:917
      in_atomic(): 1, irqs_disabled(): 128, pid: 477, name: ping
      Preemption disabled at:[<ffff80000017db58>] kprobe_perf_func+0x30/0x228
      
      CPU: 3 PID: 477 Comm: ping Not tainted 4.1.10-rt8 #4
      Hardware name: Freescale Layerscape 2085a RDB Board (DT)
      Call trace:
      [<ffff80000008a5b0>] dump_backtrace+0x0/0x128
      [<ffff80000008a6f8>] show_stack+0x20/0x30
      [<ffff8000007da90c>] dump_stack+0x7c/0xa0
      [<ffff8000000e4830>] ___might_sleep+0x188/0x1a0
      [<ffff8000007e2200>] rt_spin_lock+0x28/0x40
      [<ffff80000018bf9c>] htab_map_update_elem+0x124/0x320
      [<ffff80000018c718>] bpf_map_update_elem+0x40/0x58
      [<ffff800000187658>] __bpf_prog_run+0xd48/0x1640
      [<ffff80000017ca6c>] trace_call_bpf+0x8c/0x100
      [<ffff80000017db58>] kprobe_perf_func+0x30/0x228
      [<ffff80000017dd84>] kprobe_dispatcher+0x34/0x58
      [<ffff8000007e399c>] kprobe_handler+0x114/0x250
      [<ffff8000007e3bf4>] kprobe_breakpoint_handler+0x1c/0x30
      [<ffff800000085b80>] brk_handler+0x88/0x98
      [<ffff8000000822f0>] do_debug_exception+0x50/0xb8
      Exception stack(0xffff808349687460 to 0xffff808349687580)
      7460: 4ca2b600 ffff8083 4a3a7000 ffff8083 49687620 ffff8083 0069c5f8 ffff8000
      7480: 00000001 00000000 007e0628 ffff8000 496874b0 ffff8083 007e1de8 ffff8000
      74a0: 496874d0 ffff8083 0008e04c ffff8000 00000001 00000000 4ca2b600 ffff8083
      74c0: 00ba2e80 ffff8000 49687528 ffff8083 49687510 ffff8083 000e5c70 ffff8000
      74e0: 00c22348 ffff8000 00000000 ffff8083 49687510 ffff8083 000e5c74 ffff8000
      7500: 4ca2b600 ffff8083 49401800 ffff8083 00000001 00000000 00000000 00000000
      7520: 496874d0 ffff8083 00000000 00000000 00000000 00000000 00000000 00000000
      7540: 2f2e2d2c 33323130 00000000 00000000 4c944500 ffff8083 00000000 00000000
      7560: 00000000 00000000 008751e0 ffff8000 00000001 00000000 124e2d1d 00107b77
      
      Convert hashtab lock to raw lock to avoid such warning.
      Signed-off-by: NYang Shi <yang.shi@linaro.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ac00881f
  2. 27 10月, 2015 1 次提交
  3. 22 10月, 2015 1 次提交
    • A
      bpf: introduce bpf_perf_event_output() helper · a43eec30
      Alexei Starovoitov 提交于
      This helper is used to send raw data from eBPF program into
      special PERF_TYPE_SOFTWARE/PERF_COUNT_SW_BPF_OUTPUT perf_event.
      User space needs to perf_event_open() it (either for one or all cpus) and
      store FD into perf_event_array (similar to bpf_perf_event_read() helper)
      before eBPF program can send data into it.
      
      Today the programs triggered by kprobe collect the data and either store
      it into the maps or print it via bpf_trace_printk() where latter is the debug
      facility and not suitable to stream the data. This new helper replaces
      such bpf_trace_printk() usage and allows programs to have dedicated
      channel into user space for post-processing of the raw data collected.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a43eec30
  4. 16 10月, 2015 1 次提交
  5. 13 10月, 2015 2 次提交
    • A
      bpf: charge user for creation of BPF maps and programs · aaac3ba9
      Alexei Starovoitov 提交于
      since eBPF programs and maps use kernel memory consider it 'locked' memory
      from user accounting point of view and charge it against RLIMIT_MEMLOCK limit.
      This limit is typically set to 64Kbytes by distros, so almost all
      bpf+tracing programs would need to increase it, since they use maps,
      but kernel charges maximum map size upfront.
      For example the hash map of 1024 elements will be charged as 64Kbyte.
      It's inconvenient for current users and changes current behavior for root,
      but probably worth doing to be consistent root vs non-root.
      
      Similar accounting logic is done by mmap of perf_event.
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aaac3ba9
    • A
      bpf: enable non-root eBPF programs · 1be7f75d
      Alexei Starovoitov 提交于
      In order to let unprivileged users load and execute eBPF programs
      teach verifier to prevent pointer leaks.
      Verifier will prevent
      - any arithmetic on pointers
        (except R10+Imm which is used to compute stack addresses)
      - comparison of pointers
        (except if (map_value_ptr == 0) ... )
      - passing pointers to helper functions
      - indirectly passing pointers in stack to helper functions
      - returning pointer from bpf program
      - storing pointers into ctx or maps
      
      Spill/fill of pointers into stack is allowed, but mangling
      of pointers stored in the stack or reading them byte by byte is not.
      
      Within bpf programs the pointers do exist, since programs need to
      be able to access maps, pass skb pointer to LD_ABS insns, etc
      but programs cannot pass such pointer values to the outside
      or obfuscate them.
      
      Only allow BPF_PROG_TYPE_SOCKET_FILTER unprivileged programs,
      so that socket filters (tcpdump), af_packet (quic acceleration)
      and future kcm can use it.
      tracing and tc cls/act program types still require root permissions,
      since tracing actually needs to be able to see all kernel pointers
      and tc is for root only.
      
      For example, the following unprivileged socket filter program is allowed:
      int bpf_prog1(struct __sk_buff *skb)
      {
        u32 index = load_byte(skb, ETH_HLEN + offsetof(struct iphdr, protocol));
        u64 *value = bpf_map_lookup_elem(&my_map, &index);
      
        if (value)
      	*value += skb->len;
        return 0;
      }
      
      but the following program is not:
      int bpf_prog1(struct __sk_buff *skb)
      {
        u32 index = load_byte(skb, ETH_HLEN + offsetof(struct iphdr, protocol));
        u64 *value = bpf_map_lookup_elem(&my_map, &index);
      
        if (value)
      	*value += (u64) skb;
        return 0;
      }
      since it would leak the kernel address into the map.
      
      Unprivileged socket filter bpf programs have access to the
      following helper functions:
      - map lookup/update/delete (but they cannot store kernel pointers into them)
      - get_random (it's already exposed to unprivileged user space)
      - get_smp_processor_id
      - tail_call into another socket filter program
      - ktime_get_ns
      
      The feature is controlled by sysctl kernel.unprivileged_bpf_disabled.
      This toggle defaults to off (0), but can be set true (1).  Once true,
      bpf programs and maps cannot be accessed from unprivileged process,
      and the toggle cannot be set back to false.
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1be7f75d
  6. 11 10月, 2015 1 次提交
    • A
      bpf: fix cb access in socket filter programs · ff936a04
      Alexei Starovoitov 提交于
      eBPF socket filter programs may see junk in 'u32 cb[5]' area,
      since it could have been used by protocol layers earlier.
      
      For socket filter programs used in af_packet we need to clean
      20 bytes of skb->cb area if it could be used by the program.
      For programs attached to TCP/UDP sockets we need to save/restore
      these 20 bytes, since it's used by protocol layers.
      
      Remove SK_RUN_FILTER macro, since it's no longer used.
      
      Long term we may move this bpf cb area to per-cpu scratch, but that
      requires addition of new 'per-cpu load/store' instructions,
      so not suitable as a short term fix.
      
      Fixes: d691f9e8 ("bpf: allow programs to write to certain skb fields")
      Reported-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ff936a04
  7. 08 10月, 2015 1 次提交
    • D
      bpf: split state from prandom_u32() and consolidate {c, e}BPF prngs · 3ad00405
      Daniel Borkmann 提交于
      While recently arguing on a seccomp discussion that raw prandom_u32()
      access shouldn't be exposed to unpriviledged user space, I forgot the
      fact that SKF_AD_RANDOM extension actually already does it for some time
      in cBPF via commit 4cd3675e ("filter: added BPF random opcode").
      
      Since prandom_u32() is being used in a lot of critical networking code,
      lets be more conservative and split their states. Furthermore, consolidate
      eBPF and cBPF prandom handlers to use the new internal PRNG. For eBPF,
      bpf_get_prandom_u32() was only accessible for priviledged users, but
      should that change one day, we also don't want to leak raw sequences
      through things like eBPF maps.
      
      One thought was also to have own per bpf_prog states, but due to ABI
      reasons this is not easily possible, i.e. the program code currently
      cannot access bpf_prog itself, and copying the rnd_state to/from the
      stack scratch space whenever a program uses the prng seems not really
      worth the trouble and seems too hacky. If needed, taus113 could in such
      cases be implemented within eBPF using a map entry to keep the state
      space, or get_random_bytes() could become a second helper in cases where
      performance would not be critical.
      
      Both sides can trigger a one-time late init via prandom_init_once() on
      the shared state. Performance-wise, there should even be a tiny gain
      as bpf_user_rnd_u32() saves one function call. The PRNG needs to live
      inside the BPF core since kernels could have a NET-less config as well.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Cc: Chema Gonzalez <chema@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3ad00405
  8. 05 10月, 2015 1 次提交
  9. 03 10月, 2015 2 次提交
  10. 10 9月, 2015 2 次提交
  11. 13 8月, 2015 1 次提交
  12. 10 8月, 2015 3 次提交
  13. 27 7月, 2015 1 次提交
  14. 21 7月, 2015 1 次提交
    • A
      test_bpf: add bpf_skb_vlan_push/pop() tests · 4d9c5c53
      Alexei Starovoitov 提交于
      improve accuracy of timing in test_bpf and add two stress tests:
      - {skb->data[0], get_smp_processor_id} repeated 2k times
      - {skb->data[0], vlan_push} x 68 followed by {skb->data[0], vlan_pop} x 68
      
      1st test is useful to test performance of JIT implementation of BPF_LD_ABS
      together with BPF_CALL instructions.
      2nd test is stressing skb_vlan_push/pop logic together with skb->data access
      via BPF_LD_ABS insn which checks that re-caching of skb->data is done correctly.
      
      In order to call bpf_skb_vlan_push() from test_bpf.ko have to add
      three export_symbol_gpl.
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4d9c5c53
  15. 14 7月, 2015 1 次提交
  16. 16 6月, 2015 2 次提交
  17. 07 6月, 2015 1 次提交
    • A
      bpf: allow programs to write to certain skb fields · d691f9e8
      Alexei Starovoitov 提交于
      allow programs read/write skb->mark, tc_index fields and
      ((struct qdisc_skb_cb *)cb)->data.
      
      mark and tc_index are generically useful in TC.
      cb[0]-cb[4] are primarily used to pass arguments from one
      program to another called via bpf_tail_call() which can
      be seen in sockex3_kern.c example.
      
      All fields of 'struct __sk_buff' are readable to socket and tc_cls_act progs.
      mark, tc_index are writeable from tc_cls_act only.
      cb[0]-cb[4] are writeable by both sockets and tc_cls_act.
      
      Add verifier tests and improve sample code.
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d691f9e8
  18. 01 6月, 2015 2 次提交
  19. 31 5月, 2015 1 次提交
  20. 22 5月, 2015 1 次提交
    • A
      bpf: allow bpf programs to tail-call other bpf programs · 04fd61ab
      Alexei Starovoitov 提交于
      introduce bpf_tail_call(ctx, &jmp_table, index) helper function
      which can be used from BPF programs like:
      int bpf_prog(struct pt_regs *ctx)
      {
        ...
        bpf_tail_call(ctx, &jmp_table, index);
        ...
      }
      that is roughly equivalent to:
      int bpf_prog(struct pt_regs *ctx)
      {
        ...
        if (jmp_table[index])
          return (*jmp_table[index])(ctx);
        ...
      }
      The important detail that it's not a normal call, but a tail call.
      The kernel stack is precious, so this helper reuses the current
      stack frame and jumps into another BPF program without adding
      extra call frame.
      It's trivially done in interpreter and a bit trickier in JITs.
      In case of x64 JIT the bigger part of generated assembler prologue
      is common for all programs, so it is simply skipped while jumping.
      Other JITs can do similar prologue-skipping optimization or
      do stack unwind before jumping into the next program.
      
      bpf_tail_call() arguments:
      ctx - context pointer
      jmp_table - one of BPF_MAP_TYPE_PROG_ARRAY maps used as the jump table
      index - index in the jump table
      
      Since all BPF programs are idenitified by file descriptor, user space
      need to populate the jmp_table with FDs of other BPF programs.
      If jmp_table[index] is empty the bpf_tail_call() doesn't jump anywhere
      and program execution continues as normal.
      
      New BPF_MAP_TYPE_PROG_ARRAY map type is introduced so that user space can
      populate this jmp_table array with FDs of other bpf programs.
      Programs can share the same jmp_table array or use multiple jmp_tables.
      
      The chain of tail calls can form unpredictable dynamic loops therefore
      tail_call_cnt is used to limit the number of calls and currently is set to 32.
      
      Use cases:
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      
      ==========
      - simplify complex programs by splitting them into a sequence of small programs
      
      - dispatch routine
        For tracing and future seccomp the program may be triggered on all system
        calls, but processing of syscall arguments will be different. It's more
        efficient to implement them as:
        int syscall_entry(struct seccomp_data *ctx)
        {
           bpf_tail_call(ctx, &syscall_jmp_table, ctx->nr /* syscall number */);
           ... default: process unknown syscall ...
        }
        int sys_write_event(struct seccomp_data *ctx) {...}
        int sys_read_event(struct seccomp_data *ctx) {...}
        syscall_jmp_table[__NR_write] = sys_write_event;
        syscall_jmp_table[__NR_read] = sys_read_event;
      
        For networking the program may call into different parsers depending on
        packet format, like:
        int packet_parser(struct __sk_buff *skb)
        {
           ... parse L2, L3 here ...
           __u8 ipproto = load_byte(skb, ... offsetof(struct iphdr, protocol));
           bpf_tail_call(skb, &ipproto_jmp_table, ipproto);
           ... default: process unknown protocol ...
        }
        int parse_tcp(struct __sk_buff *skb) {...}
        int parse_udp(struct __sk_buff *skb) {...}
        ipproto_jmp_table[IPPROTO_TCP] = parse_tcp;
        ipproto_jmp_table[IPPROTO_UDP] = parse_udp;
      
      - for TC use case, bpf_tail_call() allows to implement reclassify-like logic
      
      - bpf_map_update_elem/delete calls into BPF_MAP_TYPE_PROG_ARRAY jump table
        are atomic, so user space can build chains of BPF programs on the fly
      
      Implementation details:
      =======================
      - high performance of bpf_tail_call() is the goal.
        It could have been implemented without JIT changes as a wrapper on top of
        BPF_PROG_RUN() macro, but with two downsides:
        . all programs would have to pay performance penalty for this feature and
          tail call itself would be slower, since mandatory stack unwind, return,
          stack allocate would be done for every tailcall.
        . tailcall would be limited to programs running preempt_disabled, since
          generic 'void *ctx' doesn't have room for 'tail_call_cnt' and it would
          need to be either global per_cpu variable accessed by helper and by wrapper
          or global variable protected by locks.
      
        In this implementation x64 JIT bypasses stack unwind and jumps into the
        callee program after prologue.
      
      - bpf_prog_array_compatible() ensures that prog_type of callee and caller
        are the same and JITed/non-JITed flag is the same, since calling JITed
        program from non-JITed is invalid, since stack frames are different.
        Similarly calling kprobe type program from socket type program is invalid.
      
      - jump table is implemented as BPF_MAP_TYPE_PROG_ARRAY to reuse 'map'
        abstraction, its user space API and all of verifier logic.
        It's in the existing arraymap.c file, since several functions are
        shared with regular array map.
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      04fd61ab
  21. 28 4月, 2015 1 次提交
  22. 17 4月, 2015 2 次提交
    • A
      bpf: fix two bugs in verification logic when accessing 'ctx' pointer · 725f9dcd
      Alexei Starovoitov 提交于
      1.
      first bug is a silly mistake. It broke tracing examples and prevented
      simple bpf programs from loading.
      
      In the following code:
      if (insn->imm == 0 && BPF_SIZE(insn->code) == BPF_W) {
      } else if (...) {
        // this part should have been executed when
        // insn->code == BPF_W and insn->imm != 0
      }
      
      Obviously it's not doing that. So simple instructions like:
      r2 = *(u64 *)(r1 + 8)
      will be rejected. Note the comments in the code around these branches
      were and still valid and indicate the true intent.
      
      Replace it with:
      if (BPF_SIZE(insn->code) != BPF_W)
        continue;
      
      if (insn->imm == 0) {
      } else if (...) {
        // now this code will be executed when
        // insn->code == BPF_W and insn->imm != 0
      }
      
      2.
      second bug is more subtle.
      If malicious code is using the same dest register as source register,
      the checks designed to prevent the same instruction to be used with different
      pointer types will fail to trigger, since we were assigning src_reg_type
      when it was already overwritten by check_mem_access().
      The fix is trivial. Just move line:
      src_reg_type = regs[insn->src_reg].type;
      before check_mem_access().
      Add new 'access skb fields bad4' test to check this case.
      
      Fixes: 9bac3d6d ("bpf: allow extended BPF programs access skb fields")
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      725f9dcd
    • A
      bpf: fix verifier memory corruption · c3de6317
      Alexei Starovoitov 提交于
      Due to missing bounds check the DAG pass of the BPF verifier can corrupt
      the memory which can cause random crashes during program loading:
      
      [8.449451] BUG: unable to handle kernel paging request at ffffffffffffffff
      [8.451293] IP: [<ffffffff811de33d>] kmem_cache_alloc_trace+0x8d/0x2f0
      [8.452329] Oops: 0000 [#1] SMP
      [8.452329] Call Trace:
      [8.452329]  [<ffffffff8116cc82>] bpf_check+0x852/0x2000
      [8.452329]  [<ffffffff8116b7e4>] bpf_prog_load+0x1e4/0x310
      [8.452329]  [<ffffffff811b190f>] ? might_fault+0x5f/0xb0
      [8.452329]  [<ffffffff8116c206>] SyS_bpf+0x806/0xa30
      
      Fixes: f1bca824 ("bpf: add search pruning optimization to verifier")
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c3de6317
  23. 02 4月, 2015 1 次提交
    • A
      tracing, perf: Implement BPF programs attached to kprobes · 2541517c
      Alexei Starovoitov 提交于
      BPF programs, attached to kprobes, provide a safe way to execute
      user-defined BPF byte-code programs without being able to crash or
      hang the kernel in any way. The BPF engine makes sure that such
      programs have a finite execution time and that they cannot break
      out of their sandbox.
      
      The user interface is to attach to a kprobe via the perf syscall:
      
      	struct perf_event_attr attr = {
      		.type	= PERF_TYPE_TRACEPOINT,
      		.config	= event_id,
      		...
      	};
      
      	event_fd = perf_event_open(&attr,...);
      	ioctl(event_fd, PERF_EVENT_IOC_SET_BPF, prog_fd);
      
      'prog_fd' is a file descriptor associated with BPF program
      previously loaded.
      
      'event_id' is an ID of the kprobe created.
      
      Closing 'event_fd':
      
      	close(event_fd);
      
      ... automatically detaches BPF program from it.
      
      BPF programs can call in-kernel helper functions to:
      
        - lookup/update/delete elements in maps
      
        - probe_read - wraper of probe_kernel_read() used to access any
          kernel data structures
      
      BPF programs receive 'struct pt_regs *' as an input ('struct pt_regs' is
      architecture dependent) and return 0 to ignore the event and 1 to store
      kprobe event into the ring buffer.
      
      Note, kprobes are a fundamentally _not_ a stable kernel ABI,
      so BPF programs attached to kprobes must be recompiled for
      every kernel version and user must supply correct LINUX_VERSION_CODE
      in attr.kern_version during bpf_prog_load() call.
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Reviewed-by: NSteven Rostedt <rostedt@goodmis.org>
      Reviewed-by: NMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1427312966-8434-4-git-send-email-ast@plumgrid.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      2541517c
  24. 30 3月, 2015 1 次提交
  25. 21 3月, 2015 1 次提交
    • D
      ebpf: add sched_act_type and map it to sk_filter's verifier ops · 94caee8c
      Daniel Borkmann 提交于
      In order to prepare eBPF support for tc action, we need to add
      sched_act_type, so that the eBPF verifier is aware of what helper
      function act_bpf may use, that it can load skb data and read out
      currently available skb fields.
      
      This is bascially analogous to 96be4325 ("ebpf: add sched_cls_type
      and map it to sk_filter's verifier ops").
      
      BPF_PROG_TYPE_SCHED_CLS and BPF_PROG_TYPE_SCHED_ACT need to be
      separate since both will have a different set of functionality in
      future (classifier vs action), thus we won't run into ABI troubles
      when the point in time comes to diverge functionality from the
      classifier.
      
      The future plan for act_bpf would be that it will be able to write
      into skb->data and alter selected fields mirrored in struct __sk_buff.
      
      For an initial support, it's sufficient to map it to sk_filter_ops.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Cc: Jiri Pirko <jiri@resnulli.us>
      Reviewed-by: NJiri Pirko <jiri@resnulli.us>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      94caee8c
  26. 16 3月, 2015 3 次提交