1. 02 7月, 2016 1 次提交
    • D
      bpf: generally move prog destruction to RCU deferral · 1aacde3d
      Daniel Borkmann 提交于
      Jann Horn reported following analysis that could potentially result
      in a very hard to trigger (if not impossible) UAF race, to quote his
      event timeline:
      
       - Set up a process with threads T1, T2 and T3
       - Let T1 set up a socket filter F1 that invokes another filter F2
         through a BPF map [tail call]
       - Let T1 trigger the socket filter via a unix domain socket write,
         don't wait for completion
       - Let T2 call PERF_EVENT_IOC_SET_BPF with F2, don't wait for completion
       - Now T2 should be behind bpf_prog_get(), but before bpf_prog_put()
       - Let T3 close the file descriptor for F2, dropping the reference
         count of F2 to 2
       - At this point, T1 should have looked up F2 from the map, but not
         finished executing it
       - Let T3 remove F2 from the BPF map, dropping the reference count of
         F2 to 1
       - Now T2 should call bpf_prog_put() (wrong BPF program type), dropping
         the reference count of F2 to 0 and scheduling bpf_prog_free_deferred()
         via schedule_work()
       - At this point, the BPF program could be freed
       - BPF execution is still running in a freed BPF program
      
      While at PERF_EVENT_IOC_SET_BPF time it's only guaranteed that the perf
      event fd we're doing the syscall on doesn't disappear from underneath us
      for whole syscall time, it may not be the case for the bpf fd used as
      an argument only after we did the put. It needs to be a valid fd pointing
      to a BPF program at the time of the call to make the bpf_prog_get() and
      while T2 gets preempted, F2 must have dropped reference to 1 on the other
      CPU. The fput() from the close() in T3 should also add additionally delay
      to the reference drop via exit_task_work() when bpf_prog_release() gets
      called as well as scheduling bpf_prog_free_deferred().
      
      That said, it makes nevertheless sense to move the BPF prog destruction
      generally after RCU grace period to guarantee that such scenario above,
      but also others as recently fixed in ceb56070 ("bpf, perf: delay release
      of BPF prog after grace period") with regards to tail calls won't happen.
      Integrating bpf_prog_free_deferred() directly into the RCU callback is
      not allowed since the invocation might happen from either softirq or
      process context, so we're not permitted to block. Reviewing all bpf_prog_put()
      invocations from eBPF side (note, cBPF -> eBPF progs don't use this for
      their destruction) with call_rcu() look good to me.
      
      Since we don't know whether at the time of attaching the program, we're
      already part of a tail call map, we need to use RCU variant. However, due
      to this, there won't be severely more stress on the RCU callback queue:
      situations with above bpf_prog_get() and bpf_prog_put() combo in practice
      normally won't lead to releases, but even if they would, enough effort/
      cycles have to be put into loading a BPF program into the kernel already.
      Reported-by: NJann Horn <jannh@google.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1aacde3d
  2. 16 6月, 2016 2 次提交
    • D
      bpf, maps: flush own entries on perf map release · 3b1efb19
      Daniel Borkmann 提交于
      The behavior of perf event arrays are quite different from all
      others as they are tightly coupled to perf event fds, f.e. shown
      recently by commit e03e7ee3 ("perf/bpf: Convert perf_event_array
      to use struct file") to make refcounting on perf event more robust.
      A remaining issue that the current code still has is that since
      additions to the perf event array take a reference on the struct
      file via perf_event_get() and are only released via fput() (that
      cleans up the perf event eventually via perf_event_release_kernel())
      when the element is either manually removed from the map from user
      space or automatically when the last reference on the perf event
      map is dropped. However, this leads us to dangling struct file's
      when the map gets pinned after the application owning the perf
      event descriptor exits, and since the struct file reference will
      in such case only be manually dropped or via pinned file removal,
      it leads to the perf event living longer than necessary, consuming
      needlessly resources for that time.
      
      Relations between perf event fds and bpf perf event map fds can be
      rather complex. F.e. maps can act as demuxers among different perf
      event fds that can possibly be owned by different threads and based
      on the index selection from the program, events get dispatched to
      one of the per-cpu fd endpoints. One perf event fd (or, rather a
      per-cpu set of them) can also live in multiple perf event maps at
      the same time, listening for events. Also, another requirement is
      that perf event fds can get closed from application side after they
      have been attached to the perf event map, so that on exit perf event
      map will take care of dropping their references eventually. Likewise,
      when such maps are pinned, the intended behavior is that a user
      application does bpf_obj_get(), puts its fds in there and on exit
      when fd is released, they are dropped from the map again, so the map
      acts rather as connector endpoint. This also makes perf event maps
      inherently different from program arrays as described in more detail
      in commit c9da161c ("bpf: fix clearing on persistent program
      array maps").
      
      To tackle this, map entries are marked by the map struct file that
      added the element to the map. And when the last reference to that map
      struct file is released from user space, then the tracked entries
      are purged from the map. This is okay, because new map struct files
      instances resp. frontends to the anon inode are provided via
      bpf_map_new_fd() that is called when we invoke bpf_obj_get_user()
      for retrieving a pinned map, but also when an initial instance is
      created via map_create(). The rest is resolved by the vfs layer
      automatically for us by keeping reference count on the map's struct
      file. Any concurrent updates on the map slot are fine as well, it
      just means that perf_event_fd_array_release() needs to delete less
      of its own entires.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3b1efb19
    • D
      bpf, maps: extend map_fd_get_ptr arguments · d056a788
      Daniel Borkmann 提交于
      This patch extends map_fd_get_ptr() callback that is used by fd array
      maps, so that struct file pointer from the related map can be passed
      in. It's safe to remove map_update_elem() callback for the two maps since
      this is only allowed from syscall side, but not from eBPF programs for these
      two map types. Like in per-cpu map case, bpf_fd_array_map_update_elem()
      needs to be called directly here due to the extra argument.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d056a788
  3. 09 3月, 2016 1 次提交
  4. 06 2月, 2016 2 次提交
    • A
      bpf: add lookup/update support for per-cpu hash and array maps · 15a07b33
      Alexei Starovoitov 提交于
      The functions bpf_map_lookup_elem(map, key, value) and
      bpf_map_update_elem(map, key, value, flags) need to get/set
      values from all-cpus for per-cpu hash and array maps,
      so that user space can aggregate/update them as necessary.
      
      Example of single counter aggregation in user space:
        unsigned int nr_cpus = sysconf(_SC_NPROCESSORS_CONF);
        long values[nr_cpus];
        long value = 0;
      
        bpf_lookup_elem(fd, key, values);
        for (i = 0; i < nr_cpus; i++)
          value += values[i];
      
      The user space must provide round_up(value_size, 8) * nr_cpus
      array to get/set values, since kernel will use 'long' copy
      of per-cpu values to try to copy good counters atomically.
      It's a best-effort, since bpf programs and user space are racing
      to access the same memory.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      15a07b33
    • A
      bpf: introduce BPF_MAP_TYPE_PERCPU_ARRAY map · a10423b8
      Alexei Starovoitov 提交于
      Primary use case is a histogram array of latency
      where bpf program computes the latency of block requests or other
      events and stores histogram of latency into array of 64 elements.
      All cpus are constantly running, so normal increment is not accurate,
      bpf_xadd causes cache ping-pong and this per-cpu approach allows
      fastest collision-free counters.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a10423b8
  5. 29 1月, 2016 1 次提交
  6. 03 12月, 2015 1 次提交
    • A
      bpf: fix allocation warnings in bpf maps and integer overflow · 01b3f521
      Alexei Starovoitov 提交于
      For large map->value_size the user space can trigger memory allocation warnings like:
      WARNING: CPU: 2 PID: 11122 at mm/page_alloc.c:2989
      __alloc_pages_nodemask+0x695/0x14e0()
      Call Trace:
       [<     inline     >] __dump_stack lib/dump_stack.c:15
       [<ffffffff82743b56>] dump_stack+0x68/0x92 lib/dump_stack.c:50
       [<ffffffff81244ec9>] warn_slowpath_common+0xd9/0x140 kernel/panic.c:460
       [<ffffffff812450f9>] warn_slowpath_null+0x29/0x30 kernel/panic.c:493
       [<     inline     >] __alloc_pages_slowpath mm/page_alloc.c:2989
       [<ffffffff81554e95>] __alloc_pages_nodemask+0x695/0x14e0 mm/page_alloc.c:3235
       [<ffffffff816188fe>] alloc_pages_current+0xee/0x340 mm/mempolicy.c:2055
       [<     inline     >] alloc_pages include/linux/gfp.h:451
       [<ffffffff81550706>] alloc_kmem_pages+0x16/0xf0 mm/page_alloc.c:3414
       [<ffffffff815a1c89>] kmalloc_order+0x19/0x60 mm/slab_common.c:1007
       [<ffffffff815a1cef>] kmalloc_order_trace+0x1f/0xa0 mm/slab_common.c:1018
       [<     inline     >] kmalloc_large include/linux/slab.h:390
       [<ffffffff81627784>] __kmalloc+0x234/0x250 mm/slub.c:3525
       [<     inline     >] kmalloc include/linux/slab.h:463
       [<     inline     >] map_update_elem kernel/bpf/syscall.c:288
       [<     inline     >] SYSC_bpf kernel/bpf/syscall.c:744
      
      To avoid never succeeding kmalloc with order >= MAX_ORDER check that
      elem->value_size and computed elem_size are within limits for both hash and
      array type maps.
      Also add __GFP_NOWARN to kmalloc(value_size | elem_size) to avoid OOM warnings.
      Note kmalloc(key_size) is highly unlikely to trigger OOM, since key_size <= 512,
      so keep those kmalloc-s as-is.
      
      Large value_size can cause integer overflows in elem_size and map.pages
      formulas, so check for that as well.
      
      Fixes: aaac3ba9 ("bpf: charge user for creation of BPF maps and programs")
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      01b3f521
  7. 02 12月, 2015 1 次提交
    • D
      bpf, array: fix heap out-of-bounds access when updating elements · fbca9d2d
      Daniel Borkmann 提交于
      During own review but also reported by Dmitry's syzkaller [1] it has been
      noticed that we trigger a heap out-of-bounds access on eBPF array maps
      when updating elements. This happens with each map whose map->value_size
      (specified during map creation time) is not multiple of 8 bytes.
      
      In array_map_alloc(), elem_size is round_up(attr->value_size, 8) and
      used to align array map slots for faster access. However, in function
      array_map_update_elem(), we update the element as ...
      
      memcpy(array->value + array->elem_size * index, value, array->elem_size);
      
      ... where we access 'value' out-of-bounds, since it was allocated from
      map_update_elem() from syscall side as kmalloc(map->value_size, GFP_USER)
      and later on copied through copy_from_user(value, uvalue, map->value_size).
      Thus, up to 7 bytes, we can access out-of-bounds.
      
      Same could happen from within an eBPF program, where in worst case we
      access beyond an eBPF program's designated stack.
      
      Since 1be7f75d ("bpf: enable non-root eBPF programs") didn't hit an
      official release yet, it only affects priviledged users.
      
      In case of array_map_lookup_elem(), the verifier prevents eBPF programs
      from accessing beyond map->value_size through check_map_access(). Also
      from syscall side map_lookup_elem() only copies map->value_size back to
      user, so nothing could leak.
      
        [1] http://github.com/google/syzkaller
      
      Fixes: 28fbcfa0 ("bpf: add array type of eBPF maps")
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fbca9d2d
  8. 27 10月, 2015 1 次提交
  9. 22 10月, 2015 1 次提交
    • A
      bpf: introduce bpf_perf_event_output() helper · a43eec30
      Alexei Starovoitov 提交于
      This helper is used to send raw data from eBPF program into
      special PERF_TYPE_SOFTWARE/PERF_COUNT_SW_BPF_OUTPUT perf_event.
      User space needs to perf_event_open() it (either for one or all cpus) and
      store FD into perf_event_array (similar to bpf_perf_event_read() helper)
      before eBPF program can send data into it.
      
      Today the programs triggered by kprobe collect the data and either store
      it into the maps or print it via bpf_trace_printk() where latter is the debug
      facility and not suitable to stream the data. This new helper replaces
      such bpf_trace_printk() usage and allows programs to have dedicated
      channel into user space for post-processing of the raw data collected.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a43eec30
  10. 13 10月, 2015 1 次提交
    • A
      bpf: charge user for creation of BPF maps and programs · aaac3ba9
      Alexei Starovoitov 提交于
      since eBPF programs and maps use kernel memory consider it 'locked' memory
      from user accounting point of view and charge it against RLIMIT_MEMLOCK limit.
      This limit is typically set to 64Kbytes by distros, so almost all
      bpf+tracing programs would need to increase it, since they use maps,
      but kernel charges maximum map size upfront.
      For example the hash map of 1024 elements will be charged as 64Kbyte.
      It's inconvenient for current users and changes current behavior for root,
      but probably worth doing to be consistent root vs non-root.
      
      Similar accounting logic is done by mmap of perf_event.
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aaac3ba9
  11. 05 10月, 2015 1 次提交
  12. 10 8月, 2015 2 次提交
  13. 31 5月, 2015 1 次提交
  14. 22 5月, 2015 1 次提交
    • A
      bpf: allow bpf programs to tail-call other bpf programs · 04fd61ab
      Alexei Starovoitov 提交于
      introduce bpf_tail_call(ctx, &jmp_table, index) helper function
      which can be used from BPF programs like:
      int bpf_prog(struct pt_regs *ctx)
      {
        ...
        bpf_tail_call(ctx, &jmp_table, index);
        ...
      }
      that is roughly equivalent to:
      int bpf_prog(struct pt_regs *ctx)
      {
        ...
        if (jmp_table[index])
          return (*jmp_table[index])(ctx);
        ...
      }
      The important detail that it's not a normal call, but a tail call.
      The kernel stack is precious, so this helper reuses the current
      stack frame and jumps into another BPF program without adding
      extra call frame.
      It's trivially done in interpreter and a bit trickier in JITs.
      In case of x64 JIT the bigger part of generated assembler prologue
      is common for all programs, so it is simply skipped while jumping.
      Other JITs can do similar prologue-skipping optimization or
      do stack unwind before jumping into the next program.
      
      bpf_tail_call() arguments:
      ctx - context pointer
      jmp_table - one of BPF_MAP_TYPE_PROG_ARRAY maps used as the jump table
      index - index in the jump table
      
      Since all BPF programs are idenitified by file descriptor, user space
      need to populate the jmp_table with FDs of other BPF programs.
      If jmp_table[index] is empty the bpf_tail_call() doesn't jump anywhere
      and program execution continues as normal.
      
      New BPF_MAP_TYPE_PROG_ARRAY map type is introduced so that user space can
      populate this jmp_table array with FDs of other bpf programs.
      Programs can share the same jmp_table array or use multiple jmp_tables.
      
      The chain of tail calls can form unpredictable dynamic loops therefore
      tail_call_cnt is used to limit the number of calls and currently is set to 32.
      
      Use cases:
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      
      ==========
      - simplify complex programs by splitting them into a sequence of small programs
      
      - dispatch routine
        For tracing and future seccomp the program may be triggered on all system
        calls, but processing of syscall arguments will be different. It's more
        efficient to implement them as:
        int syscall_entry(struct seccomp_data *ctx)
        {
           bpf_tail_call(ctx, &syscall_jmp_table, ctx->nr /* syscall number */);
           ... default: process unknown syscall ...
        }
        int sys_write_event(struct seccomp_data *ctx) {...}
        int sys_read_event(struct seccomp_data *ctx) {...}
        syscall_jmp_table[__NR_write] = sys_write_event;
        syscall_jmp_table[__NR_read] = sys_read_event;
      
        For networking the program may call into different parsers depending on
        packet format, like:
        int packet_parser(struct __sk_buff *skb)
        {
           ... parse L2, L3 here ...
           __u8 ipproto = load_byte(skb, ... offsetof(struct iphdr, protocol));
           bpf_tail_call(skb, &ipproto_jmp_table, ipproto);
           ... default: process unknown protocol ...
        }
        int parse_tcp(struct __sk_buff *skb) {...}
        int parse_udp(struct __sk_buff *skb) {...}
        ipproto_jmp_table[IPPROTO_TCP] = parse_tcp;
        ipproto_jmp_table[IPPROTO_UDP] = parse_udp;
      
      - for TC use case, bpf_tail_call() allows to implement reclassify-like logic
      
      - bpf_map_update_elem/delete calls into BPF_MAP_TYPE_PROG_ARRAY jump table
        are atomic, so user space can build chains of BPF programs on the fly
      
      Implementation details:
      =======================
      - high performance of bpf_tail_call() is the goal.
        It could have been implemented without JIT changes as a wrapper on top of
        BPF_PROG_RUN() macro, but with two downsides:
        . all programs would have to pay performance penalty for this feature and
          tail call itself would be slower, since mandatory stack unwind, return,
          stack allocate would be done for every tailcall.
        . tailcall would be limited to programs running preempt_disabled, since
          generic 'void *ctx' doesn't have room for 'tail_call_cnt' and it would
          need to be either global per_cpu variable accessed by helper and by wrapper
          or global variable protected by locks.
      
        In this implementation x64 JIT bypasses stack unwind and jumps into the
        callee program after prologue.
      
      - bpf_prog_array_compatible() ensures that prog_type of callee and caller
        are the same and JITed/non-JITed flag is the same, since calling JITed
        program from non-JITed is invalid, since stack frames are different.
        Similarly calling kprobe type program from socket type program is invalid.
      
      - jump table is implemented as BPF_MAP_TYPE_PROG_ARRAY to reuse 'map'
        abstraction, its user space API and all of verifier logic.
        It's in the existing arraymap.c file, since several functions are
        shared with regular array map.
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      04fd61ab
  15. 02 3月, 2015 1 次提交
  16. 20 11月, 2014 1 次提交
  17. 19 11月, 2014 1 次提交
    • A
      bpf: add array type of eBPF maps · 28fbcfa0
      Alexei Starovoitov 提交于
      add new map type BPF_MAP_TYPE_ARRAY and its implementation
      
      - optimized for fastest possible lookup()
        . in the future verifier/JIT may recognize lookup() with constant key
          and optimize it into constant pointer. Can optimize non-constant
          key into direct pointer arithmetic as well, since pointers and
          value_size are constant for the life of the eBPF program.
          In other words array_map_lookup_elem() may be 'inlined' by verifier/JIT
          while preserving concurrent access to this map from user space
      
      - two main use cases for array type:
        . 'global' eBPF variables: array of 1 element with key=0 and value is a
          collection of 'global' variables which programs can use to keep the state
          between events
        . aggregation of tracing events into fixed set of buckets
      
      - all array elements pre-allocated and zero initialized at init time
      
      - key as an index in array and can only be 4 byte
      
      - map_delete_elem() returns EINVAL, since elements cannot be deleted
      
      - map_update_elem() replaces elements in an non-atomic way
        (for atomic updates hashtable type should be used instead)
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      28fbcfa0