1. 02 2月, 2019 1 次提交
    • A
      bpf: introduce bpf_spin_lock · d83525ca
      Alexei Starovoitov 提交于
      Introduce 'struct bpf_spin_lock' and bpf_spin_lock/unlock() helpers to let
      bpf program serialize access to other variables.
      
      Example:
      struct hash_elem {
          int cnt;
          struct bpf_spin_lock lock;
      };
      struct hash_elem * val = bpf_map_lookup_elem(&hash_map, &key);
      if (val) {
          bpf_spin_lock(&val->lock);
          val->cnt++;
          bpf_spin_unlock(&val->lock);
      }
      
      Restrictions and safety checks:
      - bpf_spin_lock is only allowed inside HASH and ARRAY maps.
      - BTF description of the map is mandatory for safety analysis.
      - bpf program can take one bpf_spin_lock at a time, since two or more can
        cause dead locks.
      - only one 'struct bpf_spin_lock' is allowed per map element.
        It drastically simplifies implementation yet allows bpf program to use
        any number of bpf_spin_locks.
      - when bpf_spin_lock is taken the calls (either bpf2bpf or helpers) are not allowed.
      - bpf program must bpf_spin_unlock() before return.
      - bpf program can access 'struct bpf_spin_lock' only via
        bpf_spin_lock()/bpf_spin_unlock() helpers.
      - load/store into 'struct bpf_spin_lock lock;' field is not allowed.
      - to use bpf_spin_lock() helper the BTF description of map value must be
        a struct and have 'struct bpf_spin_lock anyname;' field at the top level.
        Nested lock inside another struct is not allowed.
      - syscall map_lookup doesn't copy bpf_spin_lock field to user space.
      - syscall map_update and program map_update do not update bpf_spin_lock field.
      - bpf_spin_lock cannot be on the stack or inside networking packet.
        bpf_spin_lock can only be inside HASH or ARRAY map value.
      - bpf_spin_lock is available to root only and to all program types.
      - bpf_spin_lock is not allowed in inner maps of map-in-map.
      - ld_abs is not allowed inside spin_lock-ed region.
      - tracing progs and socket filter progs cannot use bpf_spin_lock due to
        insufficient preemption checks
      
      Implementation details:
      - cgroup-bpf class of programs can nest with xdp/tc programs.
        Hence bpf_spin_lock is equivalent to spin_lock_irqsave.
        Other solutions to avoid nested bpf_spin_lock are possible.
        Like making sure that all networking progs run with softirq disabled.
        spin_lock_irqsave is the simplest and doesn't add overhead to the
        programs that don't use it.
      - arch_spinlock_t is used when its implemented as queued_spin_lock
      - archs can force their own arch_spinlock_t
      - on architectures where queued_spin_lock is not available and
        sizeof(arch_spinlock_t) != sizeof(__u32) trivial lock is used.
      - presence of bpf_spin_lock inside map value could have been indicated via
        extra flag during map_create, but specifying it via BTF is cleaner.
        It provides introspection for map key/value and reduces user mistakes.
      
      Next steps:
      - allow bpf_spin_lock in other map types (like cgroup local storage)
      - introduce BPF_F_LOCK flag for bpf_map_update() syscall and helper
        to request kernel to grab bpf_spin_lock before rewriting the value.
        That will serialize access to map elements.
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      d83525ca
  2. 26 10月, 2018 1 次提交
  3. 20 10月, 2018 1 次提交
    • M
      bpf: add queue and stack maps · f1a2e44a
      Mauricio Vasquez B 提交于
      Queue/stack maps implement a FIFO/LIFO data storage for ebpf programs.
      These maps support peek, pop and push operations that are exposed to eBPF
      programs through the new bpf_map[peek/pop/push] helpers.  Those operations
      are exposed to userspace applications through the already existing
      syscalls in the following way:
      
      BPF_MAP_LOOKUP_ELEM            -> peek
      BPF_MAP_LOOKUP_AND_DELETE_ELEM -> pop
      BPF_MAP_UPDATE_ELEM            -> push
      
      Queue/stack maps are implemented using a buffer, tail and head indexes,
      hence BPF_F_NO_PREALLOC is not supported.
      
      As opposite to other maps, queue and stack do not use RCU for protecting
      maps values, the bpf_map[peek/pop] have a ARG_PTR_TO_UNINIT_MAP_VALUE
      argument that is a pointer to a memory zone where to save the value of a
      map.  Basically the same as ARG_PTR_TO_UNINIT_MEM, but the size has not
      be passed as an extra argument.
      
      Our main motivation for implementing queue/stack maps was to keep track
      of a pool of elements, like network ports in a SNAT, however we forsee
      other use cases, like for exampling saving last N kernel events in a map
      and then analysing from userspace.
      Signed-off-by: NMauricio Vasquez B <mauricio.vasquez@polito.it>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      f1a2e44a
  4. 01 10月, 2018 3 次提交
    • R
      bpf: introduce per-cpu cgroup local storage · b741f163
      Roman Gushchin 提交于
      This commit introduced per-cpu cgroup local storage.
      
      Per-cpu cgroup local storage is very similar to simple cgroup storage
      (let's call it shared), except all the data is per-cpu.
      
      The main goal of per-cpu variant is to implement super fast
      counters (e.g. packet counters), which don't require neither
      lookups, neither atomic operations.
      
      >From userspace's point of view, accessing a per-cpu cgroup storage
      is similar to other per-cpu map types (e.g. per-cpu hashmaps and
      arrays).
      
      Writing to a per-cpu cgroup storage is not atomic, but is performed
      by copying longs, so some minimal atomicity is here, exactly
      as with other per-cpu maps.
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      b741f163
    • R
      bpf: rework cgroup storage pointer passing · f294b37e
      Roman Gushchin 提交于
      To simplify the following introduction of per-cpu cgroup storage,
      let's rework a bit a mechanism of passing a pointer to a cgroup
      storage into the bpf_get_local_storage(). Let's save a pointer
      to the corresponding bpf_cgroup_storage structure, instead of
      a pointer to the actual buffer.
      
      It will help us to handle per-cpu storage later, which has
      a different way of accessing to the actual data.
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      f294b37e
    • R
      bpf: extend cgroup bpf core to allow multiple cgroup storage types · 8bad74f9
      Roman Gushchin 提交于
      In order to introduce per-cpu cgroup storage, let's generalize
      bpf cgroup core to support multiple cgroup storage types.
      Potentially, per-node cgroup storage can be added later.
      
      This commit is mostly a formal change that replaces
      cgroup_storage pointer with a array of cgroup_storage pointers.
      It doesn't actually introduce a new storage type,
      it will be done later.
      
      Each bpf program is now able to have one cgroup storage of each type.
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      8bad74f9
  5. 03 8月, 2018 1 次提交
  6. 04 6月, 2018 1 次提交
    • Y
      bpf: implement bpf_get_current_cgroup_id() helper · bf6fa2c8
      Yonghong Song 提交于
      bpf has been used extensively for tracing. For example, bcc
      contains an almost full set of bpf-based tools to trace kernel
      and user functions/events. Most tracing tools are currently
      either filtered based on pid or system-wide.
      
      Containers have been used quite extensively in industry and
      cgroup is often used together to provide resource isolation
      and protection. Several processes may run inside the same
      container. It is often desirable to get container-level tracing
      results as well, e.g. syscall count, function count, I/O
      activity, etc.
      
      This patch implements a new helper, bpf_get_current_cgroup_id(),
      which will return cgroup id based on the cgroup within which
      the current task is running.
      
      The later patch will provide an example to show that
      userspace can get the same cgroup id so it could
      configure a filter or policy in the bpf program based on
      task cgroup id.
      
      The helper is currently implemented for tracing. It can
      be added to other program types as well when needed.
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      bf6fa2c8
  7. 10 1月, 2017 1 次提交
  8. 23 10月, 2016 1 次提交
  9. 21 9月, 2016 1 次提交
    • D
      bpf: direct packet write and access for helpers for clsact progs · 36bbef52
      Daniel Borkmann 提交于
      This work implements direct packet access for helpers and direct packet
      write in a similar fashion as already available for XDP types via commits
      4acf6c0b ("bpf: enable direct packet data write for xdp progs") and
      6841de8b ("bpf: allow helpers access the packet directly"), and as a
      complementary feature to the already available direct packet read for tc
      (cls/act) programs.
      
      For enabling this, we need to introduce two helpers, bpf_skb_pull_data()
      and bpf_csum_update(). The first is generally needed for both, read and
      write, because they would otherwise only be limited to the current linear
      skb head. Usually, when the data_end test fails, programs just bail out,
      or, in the direct read case, use bpf_skb_load_bytes() as an alternative
      to overcome this limitation. If such data sits in non-linear parts, we
      can just pull them in once with the new helper, retest and eventually
      access them.
      
      At the same time, this also makes sure the skb is uncloned, which is, of
      course, a necessary condition for direct write. As this needs to be an
      invariant for the write part only, the verifier detects writes and adds
      a prologue that is calling bpf_skb_pull_data() to effectively unclone the
      skb from the very beginning in case it is indeed cloned. The heuristic
      makes use of a similar trick that was done in 233577a2 ("net: filter:
      constify detection of pkt_type_offset"). This comes at zero cost for other
      programs that do not use the direct write feature. Should a program use
      this feature only sparsely and has read access for the most parts with,
      for example, drop return codes, then such write action can be delegated
      to a tail called program for mitigating this cost of potential uncloning
      to a late point in time where it would have been paid similarly with the
      bpf_skb_store_bytes() as well. Advantage of direct write is that the
      writes are inlined whereas the helper cannot make any length assumptions
      and thus needs to generate a call to memcpy() also for small sizes, as well
      as cost of helper call itself with sanity checks are avoided. Plus, when
      direct read is already used, we don't need to cache or perform rechecks
      on the data boundaries (due to verifier invalidating previous checks for
      helpers that change skb->data), so more complex programs using rewrites
      can benefit from switching to direct read plus write.
      
      For direct packet access to helpers, we save the otherwise needed copy into
      a temp struct sitting on stack memory when use-case allows. Both facilities
      are enabled via may_access_direct_pkt_data() in verifier. For now, we limit
      this to map helpers and csum_diff, and can successively enable other helpers
      where we find it makes sense. Helpers that definitely cannot be allowed for
      this are those part of bpf_helper_changes_skb_data() since they can change
      underlying data, and those that write into memory as this could happen for
      packet typed args when still cloned. bpf_csum_update() helper accommodates
      for the fact that we need to fixup checksum_complete when using direct write
      instead of bpf_skb_store_bytes(), meaning the programs can use available
      helpers like bpf_csum_diff(), and implement csum_add(), csum_sub(),
      csum_block_add(), csum_block_sub() equivalents in eBPF together with the
      new helper. A usage example will be provided for iproute2's examples/bpf/
      directory.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      36bbef52
  10. 10 9月, 2016 2 次提交
    • D
      bpf: add BPF_CALL_x macros for declaring helpers · f3694e00
      Daniel Borkmann 提交于
      This work adds BPF_CALL_<n>() macros and converts all the eBPF helper functions
      to use them, in a similar fashion like we do with SYSCALL_DEFINE<n>() macros
      that are used today. Motivation for this is to hide all the register handling
      and all necessary casts from the user, so that it is done automatically in the
      background when adding a BPF_CALL_<n>() call.
      
      This makes current helpers easier to review, eases to write future helpers,
      avoids getting the casting mess wrong, and allows for extending all helpers at
      once (f.e. build time checks, etc). It also helps detecting more easily in
      code reviews that unused registers are not instrumented in the code by accident,
      breaking compatibility with existing programs.
      
      BPF_CALL_<n>() internals are quite similar to SYSCALL_DEFINE<n>() ones with some
      fundamental differences, for example, for generating the actual helper function
      that carries all u64 regs, we need to fill unused regs, so that we always end up
      with 5 u64 regs as an argument.
      
      I reviewed several 0-5 generated BPF_CALL_<n>() variants of the .i results and
      they look all as expected. No sparse issue spotted. We let this also sit for a
      few days with Fengguang's kbuild test robot, and there were no issues seen. On
      s390, it barked on the "uses dynamic stack allocation" notice, which is an old
      one from bpf_perf_event_output{,_tp}() reappearing here due to the conversion
      to the call wrapper, just telling that the perf raw record/frag sits on stack
      (gcc with s390's -mwarn-dynamicstack), but that's all. Did various runtime tests
      and they were fine as well. All eBPF helpers are now converted to use these
      macros, getting rid of a good chunk of all the raw castings.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f3694e00
    • D
      bpf: minor cleanups in helpers · 6088b582
      Daniel Borkmann 提交于
      Some minor misc cleanups, f.e. use sizeof(__u32) instead of hardcoding
      and in __bpf_skb_max_len(), I missed that we always have skb->dev valid
      anyway, so we can drop the unneeded test for dev; also few more other
      misc bits addressed here.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6088b582
  11. 30 6月, 2016 1 次提交
  12. 15 4月, 2016 1 次提交
  13. 10 3月, 2016 1 次提交
  14. 08 10月, 2015 1 次提交
    • D
      bpf: split state from prandom_u32() and consolidate {c, e}BPF prngs · 3ad00405
      Daniel Borkmann 提交于
      While recently arguing on a seccomp discussion that raw prandom_u32()
      access shouldn't be exposed to unpriviledged user space, I forgot the
      fact that SKF_AD_RANDOM extension actually already does it for some time
      in cBPF via commit 4cd3675e ("filter: added BPF random opcode").
      
      Since prandom_u32() is being used in a lot of critical networking code,
      lets be more conservative and split their states. Furthermore, consolidate
      eBPF and cBPF prandom handlers to use the new internal PRNG. For eBPF,
      bpf_get_prandom_u32() was only accessible for priviledged users, but
      should that change one day, we also don't want to leak raw sequences
      through things like eBPF maps.
      
      One thought was also to have own per bpf_prog states, but due to ABI
      reasons this is not easily possible, i.e. the program code currently
      cannot access bpf_prog itself, and copying the rnd_state to/from the
      stack scratch space whenever a program uses the prng seems not really
      worth the trouble and seems too hacky. If needed, taus113 could in such
      cases be implemented within eBPF using a map entry to keep the state
      space, or get_random_bytes() could become a second helper in cases where
      performance would not be critical.
      
      Both sides can trigger a one-time late init via prandom_init_once() on
      the shared state. Performance-wise, there should even be a tiny gain
      as bpf_user_rnd_u32() saves one function call. The PRNG needs to live
      inside the BPF core since kernels could have a NET-less config as well.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Cc: Chema Gonzalez <chema@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3ad00405
  15. 16 6月, 2015 1 次提交
    • A
      bpf: introduce current->pid, tgid, uid, gid, comm accessors · ffeedafb
      Alexei Starovoitov 提交于
      eBPF programs attached to kprobes need to filter based on
      current->pid, uid and other fields, so introduce helper functions:
      
      u64 bpf_get_current_pid_tgid(void)
      Return: current->tgid << 32 | current->pid
      
      u64 bpf_get_current_uid_gid(void)
      Return: current_gid << 32 | current_uid
      
      bpf_get_current_comm(char *buf, int size_of_buf)
      stores current->comm into buf
      
      They can be used from the programs attached to TC as well to classify packets
      based on current task fields.
      
      Update tracex2 example to print histogram of write syscalls for each process
      instead of aggregated for all.
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ffeedafb
  16. 01 6月, 2015 2 次提交
  17. 16 3月, 2015 2 次提交
  18. 02 3月, 2015 1 次提交
  19. 19 11月, 2014 1 次提交