1. 24 1月, 2017 1 次提交
  2. 21 1月, 2017 1 次提交
    • G
      bpf: add bpf_probe_read_str helper · a5e8c070
      Gianluca Borello 提交于
      Provide a simple helper with the same semantics of strncpy_from_unsafe():
      
      int bpf_probe_read_str(void *dst, int size, const void *unsafe_addr)
      
      This gives more flexibility to a bpf program. A typical use case is
      intercepting a file name during sys_open(). The current approach is:
      
      SEC("kprobe/sys_open")
      void bpf_sys_open(struct pt_regs *ctx)
      {
      	char buf[PATHLEN]; // PATHLEN is defined to 256
      	bpf_probe_read(buf, sizeof(buf), ctx->di);
      
      	/* consume buf */
      }
      
      This is suboptimal because the size of the string needs to be estimated
      at compile time, causing more memory to be copied than often necessary,
      and can become more problematic if further processing on buf is done,
      for example by pushing it to userspace via bpf_perf_event_output(),
      since the real length of the string is unknown and the entire buffer
      must be copied (and defining an unrolled strnlen() inside the bpf
      program is a very inefficient and unfeasible approach).
      
      With the new helper, the code can easily operate on the actual string
      length rather than the buffer size:
      
      SEC("kprobe/sys_open")
      void bpf_sys_open(struct pt_regs *ctx)
      {
      	char buf[PATHLEN]; // PATHLEN is defined to 256
      	int res = bpf_probe_read_str(buf, sizeof(buf), ctx->di);
      
      	/* consume buf, for example push it to userspace via
      	 * bpf_perf_event_output(), but this time we can use
      	 * res (the string length) as event size, after checking
      	 * its boundaries.
      	 */
      }
      
      Another useful use case is when parsing individual process arguments or
      individual environment variables navigating current->mm->arg_start and
      current->mm->env_start: using this helper and the return value, one can
      quickly iterate at the right offset of the memory area.
      
      The code changes simply leverage the already existent
      strncpy_from_unsafe() kernel function, which is safe to be called from a
      bpf program as it is used in bpf_trace_printk().
      Signed-off-by: NGianluca Borello <g.borello@gmail.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a5e8c070
  3. 09 12月, 2016 1 次提交
  4. 03 12月, 2016 2 次提交
  5. 02 12月, 2016 1 次提交
    • T
      bpf: BPF for lightweight tunnel infrastructure · 3a0af8fd
      Thomas Graf 提交于
      Registers new BPF program types which correspond to the LWT hooks:
        - BPF_PROG_TYPE_LWT_IN   => dst_input()
        - BPF_PROG_TYPE_LWT_OUT  => dst_output()
        - BPF_PROG_TYPE_LWT_XMIT => lwtunnel_xmit()
      
      The separate program types are required to differentiate between the
      capabilities each LWT hook allows:
      
       * Programs attached to dst_input() or dst_output() are restricted and
         may only read the data of an skb. This prevent modification and
         possible invalidation of already validated packet headers on receive
         and the construction of illegal headers while the IP headers are
         still being assembled.
      
       * Programs attached to lwtunnel_xmit() are allowed to modify packet
         content as well as prepending an L2 header via a newly introduced
         helper bpf_skb_change_head(). This is safe as lwtunnel_xmit() is
         invoked after the IP header has been assembled completely.
      
      All BPF programs receive an skb with L3 headers attached and may return
      one of the following error codes:
      
       BPF_OK - Continue routing as per nexthop
       BPF_DROP - Drop skb and return EPERM
       BPF_REDIRECT - Redirect skb to device as per redirect() helper.
                      (Only valid in lwtunnel_xmit() context)
      
      The return codes are binary compatible with their TC_ACT_
      relatives to ease compatibility.
      Signed-off-by: NThomas Graf <tgraf@suug.ch>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3a0af8fd
  6. 26 11月, 2016 2 次提交
    • D
      bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands · f4324551
      Daniel Mack 提交于
      Extend the bpf(2) syscall by two new commands, BPF_PROG_ATTACH and
      BPF_PROG_DETACH which allow attaching and detaching eBPF programs
      to a target.
      
      On the API level, the target could be anything that has an fd in
      userspace, hence the name of the field in union bpf_attr is called
      'target_fd'.
      
      When called with BPF_ATTACH_TYPE_CGROUP_INET_{E,IN}GRESS, the target is
      expected to be a valid file descriptor of a cgroup v2 directory which
      has the bpf controller enabled. These are the only use-cases
      implemented by this patch at this point, but more can be added.
      
      If a program of the given type already exists in the given cgroup,
      the program is swapped automically, so userspace does not have to drop
      an existing program first before installing a new one, which would
      otherwise leave a gap in which no program is attached.
      
      For more information on the propagation logic to subcgroups, please
      refer to the bpf cgroup controller implementation.
      
      The API is guarded by CAP_NET_ADMIN.
      Signed-off-by: NDaniel Mack <daniel@zonque.org>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f4324551
    • D
      bpf: add new prog type for cgroup socket filtering · 0e33661d
      Daniel Mack 提交于
      This program type is similar to BPF_PROG_TYPE_SOCKET_FILTER, except that
      it does not allow BPF_LD_[ABS|IND] instructions and hooks up the
      bpf_skb_load_bytes() helper.
      
      Programs of this type will be attached to cgroups for network filtering
      and accounting.
      Signed-off-by: NDaniel Mack <daniel@zonque.org>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0e33661d
  7. 16 11月, 2016 2 次提交
  8. 30 10月, 2016 1 次提交
  9. 23 10月, 2016 1 次提交
  10. 23 9月, 2016 1 次提交
  11. 21 9月, 2016 1 次提交
    • D
      bpf: direct packet write and access for helpers for clsact progs · 36bbef52
      Daniel Borkmann 提交于
      This work implements direct packet access for helpers and direct packet
      write in a similar fashion as already available for XDP types via commits
      4acf6c0b ("bpf: enable direct packet data write for xdp progs") and
      6841de8b ("bpf: allow helpers access the packet directly"), and as a
      complementary feature to the already available direct packet read for tc
      (cls/act) programs.
      
      For enabling this, we need to introduce two helpers, bpf_skb_pull_data()
      and bpf_csum_update(). The first is generally needed for both, read and
      write, because they would otherwise only be limited to the current linear
      skb head. Usually, when the data_end test fails, programs just bail out,
      or, in the direct read case, use bpf_skb_load_bytes() as an alternative
      to overcome this limitation. If such data sits in non-linear parts, we
      can just pull them in once with the new helper, retest and eventually
      access them.
      
      At the same time, this also makes sure the skb is uncloned, which is, of
      course, a necessary condition for direct write. As this needs to be an
      invariant for the write part only, the verifier detects writes and adds
      a prologue that is calling bpf_skb_pull_data() to effectively unclone the
      skb from the very beginning in case it is indeed cloned. The heuristic
      makes use of a similar trick that was done in 233577a2 ("net: filter:
      constify detection of pkt_type_offset"). This comes at zero cost for other
      programs that do not use the direct write feature. Should a program use
      this feature only sparsely and has read access for the most parts with,
      for example, drop return codes, then such write action can be delegated
      to a tail called program for mitigating this cost of potential uncloning
      to a late point in time where it would have been paid similarly with the
      bpf_skb_store_bytes() as well. Advantage of direct write is that the
      writes are inlined whereas the helper cannot make any length assumptions
      and thus needs to generate a call to memcpy() also for small sizes, as well
      as cost of helper call itself with sanity checks are avoided. Plus, when
      direct read is already used, we don't need to cache or perform rechecks
      on the data boundaries (due to verifier invalidating previous checks for
      helpers that change skb->data), so more complex programs using rewrites
      can benefit from switching to direct read plus write.
      
      For direct packet access to helpers, we save the otherwise needed copy into
      a temp struct sitting on stack memory when use-case allows. Both facilities
      are enabled via may_access_direct_pkt_data() in verifier. For now, we limit
      this to map helpers and csum_diff, and can successively enable other helpers
      where we find it makes sense. Helpers that definitely cannot be allowed for
      this are those part of bpf_helper_changes_skb_data() since they can change
      underlying data, and those that write into memory as this could happen for
      packet typed args when still cloned. bpf_csum_update() helper accommodates
      for the fact that we need to fixup checksum_complete when using direct write
      instead of bpf_skb_store_bytes(), meaning the programs can use available
      helpers like bpf_csum_diff(), and implement csum_add(), csum_sub(),
      csum_block_add(), csum_block_sub() equivalents in eBPF together with the
      new helper. A usage example will be provided for iproute2's examples/bpf/
      directory.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      36bbef52
  12. 03 9月, 2016 1 次提交
    • A
      bpf: introduce BPF_PROG_TYPE_PERF_EVENT program type · 0515e599
      Alexei Starovoitov 提交于
      Introduce BPF_PROG_TYPE_PERF_EVENT programs that can be attached to
      HW and SW perf events (PERF_TYPE_HARDWARE and PERF_TYPE_SOFTWARE
      correspondingly in uapi/linux/perf_event.h)
      
      The program visible context meta structure is
      struct bpf_perf_event_data {
          struct pt_regs regs;
           __u64 sample_period;
      };
      which is accessible directly from the program:
      int bpf_prog(struct bpf_perf_event_data *ctx)
      {
        ... ctx->sample_period ...
        ... ctx->regs.ip ...
      }
      
      The bpf verifier rewrites the accesses into kernel internal
      struct bpf_perf_event_data_kern which allows changing
      struct perf_sample_data without affecting bpf programs.
      New fields can be added to the end of struct bpf_perf_event_data
      in the future.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0515e599
  13. 19 8月, 2016 1 次提交
    • D
      bpf: add bpf_skb_change_tail helper · 5293efe6
      Daniel Borkmann 提交于
      This work adds a bpf_skb_change_tail() helper for tc BPF programs. The
      basic idea is to expand or shrink the skb in a controlled manner. The
      eBPF program can then rewrite the rest via helpers like bpf_skb_store_bytes(),
      bpf_lX_csum_replace() and others rather than passing a raw buffer for
      writing here.
      
      bpf_skb_change_tail() is really a slow path helper and intended for
      replies with f.e. ICMP control messages. Concept is similar to other
      helpers like bpf_skb_change_proto() helper to keep the helper without
      protocol specifics and let the BPF program mangle the remaining parts.
      A flags field has been added and is reserved for now should we extend
      the helper in future.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5293efe6
  14. 13 8月, 2016 2 次提交
  15. 26 7月, 2016 1 次提交
    • S
      bpf: Add bpf_probe_write_user BPF helper to be called in tracers · 96ae5227
      Sargun Dhillon 提交于
      This allows user memory to be written to during the course of a kprobe.
      It shouldn't be used to implement any kind of security mechanism
      because of TOC-TOU attacks, but rather to debug, divert, and
      manipulate execution of semi-cooperative processes.
      
      Although it uses probe_kernel_write, we limit the address space
      the probe can write into by checking the space with access_ok.
      We do this as opposed to calling copy_to_user directly, in order
      to avoid sleeping. In addition we ensure the threads's current fs
      / segment is USER_DS and the thread isn't exiting nor a kernel thread.
      
      Given this feature is meant for experiments, and it has a risk of
      crashing the system, and running programs, we print a warning on
      when a proglet that attempts to use this helper is installed,
      along with the pid and process name.
      Signed-off-by: NSargun Dhillon <sargun@sargun.me>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      96ae5227
  16. 20 7月, 2016 2 次提交
  17. 16 7月, 2016 1 次提交
    • D
      bpf: avoid stack copy and use skb ctx for event output · 555c8a86
      Daniel Borkmann 提交于
      This work addresses a couple of issues bpf_skb_event_output()
      helper currently has: i) We need two copies instead of just a
      single one for the skb data when it should be part of a sample.
      The data can be non-linear and thus needs to be extracted via
      bpf_skb_load_bytes() helper first, and then copied once again
      into the ring buffer slot. ii) Since bpf_skb_load_bytes()
      currently needs to be used first, the helper needs to see a
      constant size on the passed stack buffer to make sure BPF
      verifier can do sanity checks on it during verification time.
      Thus, just passing skb->len (or any other non-constant value)
      wouldn't work, but changing bpf_skb_load_bytes() is also not
      the proper solution, since the two copies are generally still
      needed. iii) bpf_skb_load_bytes() is just for rather small
      buffers like headers, since they need to sit on the limited
      BPF stack anyway. Instead of working around in bpf_skb_load_bytes(),
      this work improves the bpf_skb_event_output() helper to address
      all 3 at once.
      
      We can make use of the passed in skb context that we have in
      the helper anyway, and use some of the reserved flag bits as
      a length argument. The helper will use the new __output_custom()
      facility from perf side with bpf_skb_copy() as callback helper
      to walk and extract the data. It will pass the data for setup
      to bpf_event_output(), which generates and pushes the raw record
      with an additional frag part. The linear data used in the first
      frag of the record serves as programmatically defined meta data
      passed along with the appended sample.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      555c8a86
  18. 09 7月, 2016 1 次提交
  19. 05 7月, 2016 1 次提交
  20. 02 7月, 2016 2 次提交
  21. 30 6月, 2016 3 次提交
    • D
      bpf: add bpf_skb_change_type helper · d2485c42
      Daniel Borkmann 提交于
      This work adds a helper for changing skb->pkt_type in a controlled way.
      We only allow a subset of possible values and can extend that in future
      should other use cases come up. Doing this as a helper has the advantage
      that errors can be handeled gracefully and thus helper kept extensible.
      
      It's a write counterpart to pkt_type member we can already read from
      struct __sk_buff context. Major use case is to change incoming skbs to
      PACKET_HOST in a programmatic way instead of having to recirculate via
      redirect(..., BPF_F_INGRESS), for example.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d2485c42
    • D
      bpf: add bpf_skb_change_proto helper · 6578171a
      Daniel Borkmann 提交于
      This patch adds a minimal helper for doing the groundwork of changing
      the skb->protocol in a controlled way. Currently supported is v4 to
      v6 and vice versa transitions, which allows f.e. for a minimal, static
      nat64 implementation where applications in containers that still
      require IPv4 can be transparently operated in an IPv6-only environment.
      For example, host facing veth of the container can transparently do
      the transitions in a programmatic way with the help of clsact qdisc
      and cls_bpf.
      
      Idea is to separate concerns for keeping complexity of the helper
      lower, which means that the programs utilize bpf_skb_change_proto(),
      bpf_skb_store_bytes() and bpf_lX_csum_replace() to get the job done,
      instead of doing everything in a single helper (and thus partially
      duplicating helper functionality). Also, bpf_skb_change_proto()
      shouldn't need to deal with raw packet data as this is done by other
      helpers.
      
      bpf_skb_proto_6_to_4() and bpf_skb_proto_4_to_6() unclone the skb to
      operate on a private one, push or pop additionally required header
      space and migrate the gso/gro meta data from the shared info. We do
      mark the gso type as dodgy so that headers are checked and segs
      recalculated by the gso/gro engine. The gso_size target is adapted
      as well. The flags argument added is currently reserved and can be
      used for future extensions.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6578171a
    • D
      bpf, trace: add BPF_F_CURRENT_CPU flag for bpf_perf_event_read · 6816a7ff
      Daniel Borkmann 提交于
      Follow-up commit to 1e33759c ("bpf, trace: add BPF_F_CURRENT_CPU
      flag for bpf_perf_event_output") to add the same functionality into
      bpf_perf_event_read() helper. The split of index into flags and index
      component is also safe here, since such large maps are rejected during
      map allocation time.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6816a7ff
  22. 07 5月, 2016 1 次提交
    • A
      bpf: direct packet access · 969bf05e
      Alexei Starovoitov 提交于
      Extended BPF carried over two instructions from classic to access
      packet data: LD_ABS and LD_IND. They're highly optimized in JITs,
      but due to their design they have to do length check for every access.
      When BPF is processing 20M packets per second single LD_ABS after JIT
      is consuming 3% cpu. Hence the need to optimize it further by amortizing
      the cost of 'off < skb_headlen' over multiple packet accesses.
      One option is to introduce two new eBPF instructions LD_ABS_DW and LD_IND_DW
      with similar usage as skb_header_pointer().
      The kernel part for interpreter and x64 JIT was implemented in [1], but such
      new insns behave like old ld_abs and abort the program with 'return 0' if
      access is beyond linear data. Such hidden control flow is hard to workaround
      plus changing JITs and rolling out new llvm is incovenient.
      
      Therefore allow cls_bpf/act_bpf program access skb->data directly:
      int bpf_prog(struct __sk_buff *skb)
      {
        struct iphdr *ip;
      
        if (skb->data + sizeof(struct iphdr) + ETH_HLEN > skb->data_end)
            /* packet too small */
            return 0;
      
        ip = skb->data + ETH_HLEN;
      
        /* access IP header fields with direct loads */
        if (ip->version != 4 || ip->saddr == 0x7f000001)
            return 1;
        [...]
      }
      
      This solution avoids introduction of new instructions. llvm stays
      the same and all JITs stay the same, but verifier has to work extra hard
      to prove safety of the above program.
      
      For XDP the direct store instructions can be allowed as well.
      
      The skb->data is NET_IP_ALIGNED, so for common cases the verifier can check
      the alignment. The complex packet parsers where packet pointer is adjusted
      incrementally cannot be tracked for alignment, so allow byte access in such cases
      and misaligned access on architectures that define efficient_unaligned_access
      
      [1] https://git.kernel.org/cgit/linux/kernel/git/ast/bpf.git/?h=ld_abs_dwSigned-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      969bf05e
  23. 20 4月, 2016 1 次提交
  24. 08 4月, 2016 1 次提交
    • A
      perf, bpf: allow bpf programs attach to tracepoints · 98b5c2c6
      Alexei Starovoitov 提交于
      introduce BPF_PROG_TYPE_TRACEPOINT program type and allow it to be attached
      to the perf tracepoint handler, which will copy the arguments into
      the per-cpu buffer and pass it to the bpf program as its first argument.
      The layout of the fields can be discovered by doing
      'cat /sys/kernel/debug/tracing/events/sched/sched_switch/format'
      prior to the compilation of the program with exception that first 8 bytes
      are reserved and not accessible to the program. This area is used to store
      the pointer to 'struct pt_regs' which some of the bpf helpers will use:
      +---------+
      | 8 bytes | hidden 'struct pt_regs *' (inaccessible to bpf program)
      +---------+
      | N bytes | static tracepoint fields defined in tracepoint/format (bpf readonly)
      +---------+
      | dynamic | __dynamic_array bytes of tracepoint (inaccessible to bpf yet)
      +---------+
      
      Not that all of the fields are already dumped to user space via perf ring buffer
      and broken application access it directly without consulting tracepoint/format.
      Same rule applies here: static tracepoint fields should only be accessed
      in a format defined in tracepoint/format. The order of fields and
      field sizes are not an ABI.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      98b5c2c6
  25. 31 3月, 2016 1 次提交
    • D
      bpf: make padding in bpf_tunnel_key explicit · c0e760c9
      Daniel Borkmann 提交于
      Make the 2 byte padding in struct bpf_tunnel_key between tunnel_ttl
      and tunnel_label members explicit. No issue has been observed, and
      gcc/llvm does padding for the old struct already, where tunnel_label
      was not yet present, so the current code works, but since it's part
      of uapi, make sure we don't introduce holes in structs.
      
      Therefore, add tunnel_ext that we can use generically in future
      (f.e. to flag OAM messages for backends, etc). Also add the offset
      to the compat tests to be sure should some compilers not padd the
      tail of the old version of bpf_tunnel_key.
      
      Fixes: 4018ab18 ("bpf: support flow label for bpf_skb_{set, get}_tunnel_key")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c0e760c9
  26. 12 3月, 2016 1 次提交
  27. 09 3月, 2016 4 次提交
    • A
      bpf: pre-allocate hash map elements · 6c905981
      Alexei Starovoitov 提交于
      If kprobe is placed on spin_unlock then calling kmalloc/kfree from
      bpf programs is not safe, since the following dead lock is possible:
      kfree->spin_lock(kmem_cache_node->lock)...spin_unlock->kprobe->
      bpf_prog->map_update->kmalloc->spin_lock(of the same kmem_cache_node->lock)
      and deadlocks.
      
      The following solutions were considered and some implemented, but
      eventually discarded
      - kmem_cache_create for every map
      - add recursion check to slow-path of slub
      - use reserved memory in bpf_map_update for in_irq or in preempt_disabled
      - kmalloc via irq_work
      
      At the end pre-allocation of all map elements turned out to be the simplest
      solution and since the user is charged upfront for all the memory, such
      pre-allocation doesn't affect the user space visible behavior.
      
      Since it's impossible to tell whether kprobe is triggered in a safe
      location from kmalloc point of view, use pre-allocation by default
      and introduce new BPF_F_NO_PREALLOC flag.
      
      While testing of per-cpu hash maps it was discovered
      that alloc_percpu(GFP_ATOMIC) has odd corner cases and often
      fails to allocate memory even when 90% of it is free.
      The pre-allocation of per-cpu hash elements solves this problem as well.
      
      Turned out that bpf_map_update() quickly followed by
      bpf_map_lookup()+bpf_map_delete() is very common pattern used
      in many of iovisor/bcc/tools, so there is additional benefit of
      pre-allocation, since such use cases are must faster.
      
      Since all hash map elements are now pre-allocated we can remove
      atomic increment of htab->count and save few more cycles.
      
      Also add bpf_map_precharge_memlock() to check rlimit_memlock early to avoid
      large malloc/free done by users who don't have sufficient limits.
      
      Pre-allocation is done with vmalloc and alloc/free is done
      via percpu_freelist. Here are performance numbers for different
      pre-allocation algorithms that were implemented, but discarded
      in favor of percpu_freelist:
      
      1 cpu:
      pcpu_ida	2.1M
      pcpu_ida nolock	2.3M
      bt		2.4M
      kmalloc		1.8M
      hlist+spinlock	2.3M
      pcpu_freelist	2.6M
      
      4 cpu:
      pcpu_ida	1.5M
      pcpu_ida nolock	1.8M
      bt w/smp_align	1.7M
      bt no/smp_align	1.1M
      kmalloc		0.7M
      hlist+spinlock	0.2M
      pcpu_freelist	2.0M
      
      8 cpu:
      pcpu_ida	0.7M
      bt w/smp_align	0.8M
      kmalloc		0.4M
      pcpu_freelist	1.5M
      
      32 cpu:
      kmalloc		0.13M
      pcpu_freelist	0.49M
      
      pcpu_ida nolock is a modified percpu_ida algorithm without
      percpu_ida_cpu locks and without cross-cpu tag stealing.
      It's faster than existing percpu_ida, but not as fast as pcpu_freelist.
      
      bt is a variant of block/blk-mq-tag.c simlified and customized
      for bpf use case. bt w/smp_align is using cache line for every 'long'
      (similar to blk-mq-tag). bt no/smp_align allocates 'long'
      bitmasks continuously to save memory. It's comparable to percpu_ida
      and in some cases faster, but slower than percpu_freelist
      
      hlist+spinlock is the simplest free list with single spinlock.
      As expeceted it has very bad scaling in SMP.
      
      kmalloc is existing implementation which is still available via
      BPF_F_NO_PREALLOC flag. It's significantly slower in single cpu and
      in 8 cpu setup it's 3 times slower than pre-allocation with pcpu_freelist,
      but saves memory, so in cases where map->max_entries can be large
      and number of map update/delete per second is low, it may make
      sense to use it.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6c905981
    • D
      bpf: support for access to tunnel options · 14ca0751
      Daniel Borkmann 提交于
      After eBPF being able to programmatically access/manage tunnel key meta
      data via commit d3aa45ce ("bpf: add helpers to access tunnel metadata")
      and more recently also for IPv6 through c6c33454 ("bpf: support ipv6
      for bpf_skb_{set,get}_tunnel_key"), this work adds two complementary
      helpers to generically access their auxiliary tunnel options.
      
      Geneve and vxlan support this facility. For geneve, TLVs can be pushed,
      and for the vxlan case its GBP extension. I.e. setting tunnel key for geneve
      case only makes sense, if we can also read/write TLVs into it. In the GBP
      case, it provides the flexibility to easily map the group policy ID in
      combination with other helpers or maps.
      
      I chose to model this as two separate helpers, bpf_skb_{set,get}_tunnel_opt(),
      for a couple of reasons. bpf_skb_{set,get}_tunnel_key() is already rather
      complex by itself, and there may be cases for tunnel key backends where
      tunnel options are not always needed. If we would have integrated this
      into bpf_skb_{set,get}_tunnel_key() nevertheless, we are very limited with
      remaining helper arguments, so keeping compatibility on structs in case of
      passing in a flat buffer gets more cumbersome. Separating both also allows
      for more flexibility and future extensibility, f.e. options could be fed
      directly from a map, etc.
      
      Moreover, change geneve's xmit path to test only for info->options_len
      instead of TUNNEL_GENEVE_OPT flag. This makes it more consistent with vxlan's
      xmit path and allows for avoiding to specify a protocol flag in the API on
      xmit, so it can be protocol agnostic. Having info->options_len is enough
      information that is needed. Tested with vxlan and geneve.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      14ca0751
    • D
      bpf: allow to propagate df in bpf_skb_set_tunnel_key · 22080870
      Daniel Borkmann 提交于
      Added by 9a628224 ("ip_tunnel: Add dont fragment flag."), allow to
      feed df flag into tunneling facilities (currently supported on TX by
      vxlan, geneve and gre) as a hint from eBPF's bpf_skb_set_tunnel_key()
      helper.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      22080870
    • D
      bpf: add flags to bpf_skb_store_bytes for clearing hash · 8afd54c8
      Daniel Borkmann 提交于
      When overwriting parts of the packet with bpf_skb_store_bytes() that
      were fed previously into skb->hash calculation, we should clear the
      current hash with skb_clear_hash(), so that a next skb_get_hash() call
      can determine the correct hash related to this skb.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8afd54c8
  28. 25 2月, 2016 1 次提交
    • D
      bpf: fix csum setting for bpf_set_tunnel_key · 2da897e5
      Daniel Borkmann 提交于
      The fix in 35e2d115 ("tunnels: Allow IPv6 UDP checksums to be correctly
      controlled.") changed behavior for bpf_set_tunnel_key() when in use with
      IPv6 and thus uncovered a bug that TUNNEL_CSUM needed to be set but wasn't.
      As a result, the stack dropped ingress vxlan IPv6 packets, that have been
      sent via eBPF through collect meta data mode due to checksum now being zero.
      
      Since after LCO, we enable IPv4 checksum by default, so make that analogous
      and only provide a flag BPF_F_ZERO_CSUM_TX for the user to turn it off in
      IPv4 case.
      
      Fixes: 35e2d115 ("tunnels: Allow IPv6 UDP checksums to be correctly controlled.")
      Fixes: c6c33454 ("bpf: support ipv6 for bpf_skb_{set,get}_tunnel_key")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2da897e5
  29. 22 2月, 2016 1 次提交