1. 20 12月, 2019 1 次提交
    • A
      bpf: Support replacing cgroup-bpf program in MULTI mode · 7dd68b32
      Andrey Ignatov 提交于
      The common use-case in production is to have multiple cgroup-bpf
      programs per attach type that cover multiple use-cases. Such programs
      are attached with BPF_F_ALLOW_MULTI and can be maintained by different
      people.
      
      Order of programs usually matters, for example imagine two egress
      programs: the first one drops packets and the second one counts packets.
      If they're swapped the result of counting program will be different.
      
      It brings operational challenges with updating cgroup-bpf program(s)
      attached with BPF_F_ALLOW_MULTI since there is no way to replace a
      program:
      
      * One way to update is to detach all programs first and then attach the
        new version(s) again in the right order. This introduces an
        interruption in the work a program is doing and may not be acceptable
        (e.g. if it's egress firewall);
      
      * Another way is attach the new version of a program first and only then
        detach the old version. This introduces the time interval when two
        versions of same program are working, what may not be acceptable if a
        program is not idempotent. It also imposes additional burden on
        program developers to make sure that two versions of their program can
        co-exist.
      
      Solve the problem by introducing a "replace" mode in BPF_PROG_ATTACH
      command for cgroup-bpf programs being attached with BPF_F_ALLOW_MULTI
      flag. This mode is enabled by newly introduced BPF_F_REPLACE attach flag
      and bpf_attr.replace_bpf_fd attribute to pass fd of the old program to
      replace
      
      That way user can replace any program among those attached with
      BPF_F_ALLOW_MULTI flag without the problems described above.
      
      Details of the new API:
      
      * If BPF_F_REPLACE is set but replace_bpf_fd doesn't have valid
        descriptor of BPF program, BPF_PROG_ATTACH will return corresponding
        error (EINVAL or EBADF).
      
      * If replace_bpf_fd has valid descriptor of BPF program but such a
        program is not attached to specified cgroup, BPF_PROG_ATTACH will
        return ENOENT.
      
      BPF_F_REPLACE is introduced to make the user intent clear, since
      replace_bpf_fd alone can't be used for this (its default value, 0, is a
      valid fd). BPF_F_REPLACE also makes it possible to extend the API in the
      future (e.g. add BPF_F_BEFORE and BPF_F_AFTER if needed).
      Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAndrii Narkyiko <andriin@fb.com>
      Link: https://lore.kernel.org/bpf/30cd850044a0057bdfcaaf154b7d2f39850ba813.1576741281.git.rdna@fb.com
      7dd68b32
  2. 18 11月, 2019 1 次提交
    • A
      bpf: Add mmap() support for BPF_MAP_TYPE_ARRAY · fc970227
      Andrii Nakryiko 提交于
      Add ability to memory-map contents of BPF array map. This is extremely useful
      for working with BPF global data from userspace programs. It allows to avoid
      typical bpf_map_{lookup,update}_elem operations, improving both performance
      and usability.
      
      There had to be special considerations for map freezing, to avoid having
      writable memory view into a frozen map. To solve this issue, map freezing and
      mmap-ing is happening under mutex now:
        - if map is already frozen, no writable mapping is allowed;
        - if map has writable memory mappings active (accounted in map->writecnt),
          map freezing will keep failing with -EBUSY;
        - once number of writable memory mappings drops to zero, map freezing can be
          performed again.
      
      Only non-per-CPU plain arrays are supported right now. Maps with spinlocks
      can't be memory mapped either.
      
      For BPF_F_MMAPABLE array, memory allocation has to be done through vmalloc()
      to be mmap()'able. We also need to make sure that array data memory is
      page-sized and page-aligned, so we over-allocate memory in such a way that
      struct bpf_array is at the end of a single page of memory with array->value
      being aligned with the start of the second page. On deallocation we need to
      accomodate this memory arrangement to free vmalloc()'ed memory correctly.
      
      One important consideration regarding how memory-mapping subsystem functions.
      Memory-mapping subsystem provides few optional callbacks, among them open()
      and close().  close() is called for each memory region that is unmapped, so
      that users can decrease their reference counters and free up resources, if
      necessary. open() is *almost* symmetrical: it's called for each memory region
      that is being mapped, **except** the very first one. So bpf_map_mmap does
      initial refcnt bump, while open() will do any extra ones after that. Thus
      number of close() calls is equal to number of open() calls plus one more.
      Signed-off-by: NAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Link: https://lore.kernel.org/bpf/20191117172806.2195367-4-andriin@fb.com
      fc970227
  3. 16 11月, 2019 2 次提交
    • A
      bpf: Support attaching tracing BPF program to other BPF programs · 5b92a28a
      Alexei Starovoitov 提交于
      Allow FENTRY/FEXIT BPF programs to attach to other BPF programs of any type
      including their subprograms. This feature allows snooping on input and output
      packets in XDP, TC programs including their return values. In order to do that
      the verifier needs to track types not only of vmlinux, but types of other BPF
      programs as well. The verifier also needs to translate uapi/linux/bpf.h types
      used by networking programs into kernel internal BTF types used by FENTRY/FEXIT
      BPF programs. In some cases LLVM optimizations can remove arguments from BPF
      subprograms without adjusting BTF info that LLVM backend knows. When BTF info
      disagrees with actual types that the verifiers sees the BPF trampoline has to
      fallback to conservative and treat all arguments as u64. The FENTRY/FEXIT
      program can still attach to such subprograms, but it won't be able to recognize
      pointer types like 'struct sk_buff *' and it won't be able to pass them to
      bpf_skb_output() for dumping packets to user space. The FENTRY/FEXIT program
      would need to use bpf_probe_read_kernel() instead.
      
      The BPF_PROG_LOAD command is extended with attach_prog_fd field. When it's set
      to zero the attach_btf_id is one vmlinux BTF type ids. When attach_prog_fd
      points to previously loaded BPF program the attach_btf_id is BTF type id of
      main function or one of its subprograms.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Link: https://lore.kernel.org/bpf/20191114185720.1641606-18-ast@kernel.org
      5b92a28a
    • A
      bpf: Introduce BPF trampoline · fec56f58
      Alexei Starovoitov 提交于
      Introduce BPF trampoline concept to allow kernel code to call into BPF programs
      with practically zero overhead.  The trampoline generation logic is
      architecture dependent.  It's converting native calling convention into BPF
      calling convention.  BPF ISA is 64-bit (even on 32-bit architectures). The
      registers R1 to R5 are used to pass arguments into BPF functions. The main BPF
      program accepts only single argument "ctx" in R1. Whereas CPU native calling
      convention is different. x86-64 is passing first 6 arguments in registers
      and the rest on the stack. x86-32 is passing first 3 arguments in registers.
      sparc64 is passing first 6 in registers. And so on.
      
      The trampolines between BPF and kernel already exist.  BPF_CALL_x macros in
      include/linux/filter.h statically compile trampolines from BPF into kernel
      helpers. They convert up to five u64 arguments into kernel C pointers and
      integers. On 64-bit architectures this BPF_to_kernel trampolines are nops. On
      32-bit architecture they're meaningful.
      
      The opposite job kernel_to_BPF trampolines is done by CAST_TO_U64 macros and
      __bpf_trace_##call() shim functions in include/trace/bpf_probe.h. They convert
      kernel function arguments into array of u64s that BPF program consumes via
      R1=ctx pointer.
      
      This patch set is doing the same job as __bpf_trace_##call() static
      trampolines, but dynamically for any kernel function. There are ~22k global
      kernel functions that are attachable via nop at function entry. The function
      arguments and types are described in BTF.  The job of btf_distill_func_proto()
      function is to extract useful information from BTF into "function model" that
      architecture dependent trampoline generators will use to generate assembly code
      to cast kernel function arguments into array of u64s.  For example the kernel
      function eth_type_trans has two pointers. They will be casted to u64 and stored
      into stack of generated trampoline. The pointer to that stack space will be
      passed into BPF program in R1. On x86-64 such generated trampoline will consume
      16 bytes of stack and two stores of %rdi and %rsi into stack. The verifier will
      make sure that only two u64 are accessed read-only by BPF program. The verifier
      will also recognize the precise type of the pointers being accessed and will
      not allow typecasting of the pointer to a different type within BPF program.
      
      The tracing use case in the datacenter demonstrated that certain key kernel
      functions have (like tcp_retransmit_skb) have 2 or more kprobes that are always
      active.  Other functions have both kprobe and kretprobe.  So it is essential to
      keep both kernel code and BPF programs executing at maximum speed. Hence
      generated BPF trampoline is re-generated every time new program is attached or
      detached to maintain maximum performance.
      
      To avoid the high cost of retpoline the attached BPF programs are called
      directly. __bpf_prog_enter/exit() are used to support per-program execution
      stats.  In the future this logic will be optimized further by adding support
      for bpf_stats_enabled_key inside generated assembly code. Introduction of
      preemptible and sleepable BPF programs will completely remove the need to call
      to __bpf_prog_enter/exit().
      
      Detach of a BPF program from the trampoline should not fail. To avoid memory
      allocation in detach path the half of the page is used as a reserve and flipped
      after each attach/detach. 2k bytes is enough to call 40+ BPF programs directly
      which is enough for BPF tracing use cases. This limit can be increased in the
      future.
      
      BPF_TRACE_FENTRY programs have access to raw kernel function arguments while
      BPF_TRACE_FEXIT programs have access to kernel return value as well. Often
      kprobe BPF program remembers function arguments in a map while kretprobe
      fetches arguments from a map and analyzes them together with return value.
      BPF_TRACE_FEXIT accelerates this typical use case.
      
      Recursion prevention for kprobe BPF programs is done via per-cpu
      bpf_prog_active counter. In practice that turned out to be a mistake. It
      caused programs to randomly skip execution. The tracing tools missed results
      they were looking for. Hence BPF trampoline doesn't provide builtin recursion
      prevention. It's a job of BPF program itself and will be addressed in the
      follow up patches.
      
      BPF trampoline is intended to be used beyond tracing and fentry/fexit use cases
      in the future. For example to remove retpoline cost from XDP programs.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Link: https://lore.kernel.org/bpf/20191114185720.1641606-5-ast@kernel.org
      fec56f58
  4. 03 11月, 2019 1 次提交
    • D
      bpf: Add probe_read_{user, kernel} and probe_read_{user, kernel}_str helpers · 6ae08ae3
      Daniel Borkmann 提交于
      The current bpf_probe_read() and bpf_probe_read_str() helpers are broken
      in that they assume they can be used for probing memory access for kernel
      space addresses /as well as/ user space addresses.
      
      However, plain use of probe_kernel_read() for both cases will attempt to
      always access kernel space address space given access is performed under
      KERNEL_DS and some archs in-fact have overlapping address spaces where a
      kernel pointer and user pointer would have the /same/ address value and
      therefore accessing application memory via bpf_probe_read{,_str}() would
      read garbage values.
      
      Lets fix BPF side by making use of recently added 3d708182 ("uaccess:
      Add non-pagefault user-space read functions"). Unfortunately, the only way
      to fix this status quo is to add dedicated bpf_probe_read_{user,kernel}()
      and bpf_probe_read_{user,kernel}_str() helpers. The bpf_probe_read{,_str}()
      helpers are kept as-is to retain their current behavior.
      
      The two *_user() variants attempt the access always under USER_DS set, the
      two *_kernel() variants will -EFAULT when accessing user memory if the
      underlying architecture has non-overlapping address ranges, also avoiding
      throwing the kernel warning via 00c42373 ("x86-64: add warning for
      non-canonical user access address dereferences").
      
      Fixes: a5e8c070 ("bpf: add bpf_probe_read_str helper")
      Fixes: 2541517c ("tracing, perf: Implement BPF programs attached to kprobes")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Link: https://lore.kernel.org/bpf/796ee46e948bc808d54891a1108435f8652c6ca4.1572649915.git.daniel@iogearbox.net
      6ae08ae3
  5. 31 10月, 2019 1 次提交
    • A
      bpf: Replace prog_raw_tp+btf_id with prog_tracing · f1b9509c
      Alexei Starovoitov 提交于
      The bpf program type raw_tp together with 'expected_attach_type'
      was the most appropriate api to indicate BTF-enabled raw_tp programs.
      But during development it became apparent that 'expected_attach_type'
      cannot be used and new 'attach_btf_id' field had to be introduced.
      Which means that the information is duplicated in two fields where
      one of them is ignored.
      Clean it up by introducing new program type where both
      'expected_attach_type' and 'attach_btf_id' fields have
      specific meaning.
      In the future 'expected_attach_type' will be extended
      with other attach points that have similar semantics to raw_tp.
      This patch is replacing BTF-enabled BPF_PROG_TYPE_RAW_TRACEPOINT with
      prog_type = BPF_RPOG_TYPE_TRACING
      expected_attach_type = BPF_TRACE_RAW_TP
      attach_btf_id = btf_id of raw tracepoint inside the kernel
      Future patches will add
      expected_attach_type = BPF_TRACE_FENTRY or BPF_TRACE_FEXIT
      where programs have the same input context and the same helpers,
      but different attach points.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/20191030223212.953010-2-ast@kernel.org
      f1b9509c
  6. 17 10月, 2019 2 次提交
  7. 07 10月, 2019 1 次提交
  8. 28 8月, 2019 1 次提交
  9. 22 8月, 2019 2 次提交
  10. 21 8月, 2019 1 次提交
  11. 18 8月, 2019 1 次提交
  12. 10 8月, 2019 1 次提交
    • D
      sock: make cookie generation global instead of per netns · cd48bdda
      Daniel Borkmann 提交于
      Generating and retrieving socket cookies are a useful feature that is
      exposed to BPF for various program types through bpf_get_socket_cookie()
      helper.
      
      The fact that the cookie counter is per netns is quite a limitation
      for BPF in practice in particular for programs in host namespace that
      use socket cookies as part of a map lookup key since they will be
      causing socket cookie collisions e.g. when attached to BPF cgroup hooks
      or cls_bpf on tc egress in host namespace handling container traffic
      from veth or ipvlan devices with peer in different netns. Change the
      counter to be global instead.
      
      Socket cookie consumers must assume the value as opqaue in any case.
      Not every socket must have a cookie generated and knowledge of the
      counter value itself does not provide much value either way hence
      conversion to global is fine.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Willem de Bruijn <willemb@google.com>
      Cc: Martynas Pumputis <m@lambda.lt>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cd48bdda
  13. 31 7月, 2019 1 次提交
  14. 30 7月, 2019 1 次提交
  15. 26 7月, 2019 2 次提交
  16. 16 7月, 2019 1 次提交
  17. 15 7月, 2019 1 次提交
  18. 08 7月, 2019 1 次提交
  19. 03 7月, 2019 3 次提交
  20. 29 6月, 2019 2 次提交
  21. 28 6月, 2019 1 次提交
    • S
      bpf: implement getsockopt and setsockopt hooks · 0d01da6a
      Stanislav Fomichev 提交于
      Implement new BPF_PROG_TYPE_CGROUP_SOCKOPT program type and
      BPF_CGROUP_{G,S}ETSOCKOPT cgroup hooks.
      
      BPF_CGROUP_SETSOCKOPT can modify user setsockopt arguments before
      passing them down to the kernel or bypass kernel completely.
      BPF_CGROUP_GETSOCKOPT can can inspect/modify getsockopt arguments that
      kernel returns.
      Both hooks reuse existing PTR_TO_PACKET{,_END} infrastructure.
      
      The buffer memory is pre-allocated (because I don't think there is
      a precedent for working with __user memory from bpf). This might be
      slow to do for each {s,g}etsockopt call, that's why I've added
      __cgroup_bpf_prog_array_is_empty that exits early if there is nothing
      attached to a cgroup. Note, however, that there is a race between
      __cgroup_bpf_prog_array_is_empty and BPF_PROG_RUN_ARRAY where cgroup
      program layout might have changed; this should not be a problem
      because in general there is a race between multiple calls to
      {s,g}etsocktop and user adding/removing bpf progs from a cgroup.
      
      The return code of the BPF program is handled as follows:
      * 0: EPERM
      * 1: success, continue with next BPF program in the cgroup chain
      
      v9:
      * allow overwriting setsockopt arguments (Alexei Starovoitov):
        * use set_fs (same as kernel_setsockopt)
        * buffer is always kzalloc'd (no small on-stack buffer)
      
      v8:
      * use s32 for optlen (Andrii Nakryiko)
      
      v7:
      * return only 0 or 1 (Alexei Starovoitov)
      * always run all progs (Alexei Starovoitov)
      * use optval=0 as kernel bypass in setsockopt (Alexei Starovoitov)
        (decided to use optval=-1 instead, optval=0 might be a valid input)
      * call getsockopt hook after kernel handlers (Alexei Starovoitov)
      
      v6:
      * rework cgroup chaining; stop as soon as bpf program returns
        0 or 2; see patch with the documentation for the details
      * drop Andrii's and Martin's Acked-by (not sure they are comfortable
        with the new state of things)
      
      v5:
      * skip copy_to_user() and put_user() when ret == 0 (Martin Lau)
      
      v4:
      * don't export bpf_sk_fullsock helper (Martin Lau)
      * size != sizeof(__u64) for uapi pointers (Martin Lau)
      * offsetof instead of bpf_ctx_range when checking ctx access (Martin Lau)
      
      v3:
      * typos in BPF_PROG_CGROUP_SOCKOPT_RUN_ARRAY comments (Andrii Nakryiko)
      * reverse christmas tree in BPF_PROG_CGROUP_SOCKOPT_RUN_ARRAY (Andrii
        Nakryiko)
      * use __bpf_md_ptr instead of __u32 for optval{,_end} (Martin Lau)
      * use BPF_FIELD_SIZEOF() for consistency (Martin Lau)
      * new CG_SOCKOPT_ACCESS macro to wrap repeated parts
      
      v2:
      * moved bpf_sockopt_kern fields around to remove a hole (Martin Lau)
      * aligned bpf_sockopt_kern->buf to 8 bytes (Martin Lau)
      * bpf_prog_array_is_empty instead of bpf_prog_array_length (Martin Lau)
      * added [0,2] return code check to verifier (Martin Lau)
      * dropped unused buf[64] from the stack (Martin Lau)
      * use PTR_TO_SOCKET for bpf_sockopt->sk (Martin Lau)
      * dropped bpf_target_off from ctx rewrites (Martin Lau)
      * use return code for kernel bypass (Martin Lau & Andrii Nakryiko)
      
      Cc: Andrii Nakryiko <andriin@fb.com>
      Cc: Martin Lau <kafai@fb.com>
      Signed-off-by: NStanislav Fomichev <sdf@google.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      0d01da6a
  22. 15 6月, 2019 3 次提交
  23. 14 6月, 2019 1 次提交
  24. 11 6月, 2019 1 次提交
  25. 07 6月, 2019 1 次提交
    • D
      bpf: fix unconnected udp hooks · 983695fa
      Daniel Borkmann 提交于
      Intention of cgroup bind/connect/sendmsg BPF hooks is to act transparently
      to applications as also stated in original motivation in 7828f20e ("Merge
      branch 'bpf-cgroup-bind-connect'"). When recently integrating the latter
      two hooks into Cilium to enable host based load-balancing with Kubernetes,
      I ran into the issue that pods couldn't start up as DNS got broken. Kubernetes
      typically sets up DNS as a service and is thus subject to load-balancing.
      
      Upon further debugging, it turns out that the cgroupv2 sendmsg BPF hooks API
      is currently insufficient and thus not usable as-is for standard applications
      shipped with most distros. To break down the issue we ran into with a simple
      example:
      
        # cat /etc/resolv.conf
        nameserver 147.75.207.207
        nameserver 147.75.207.208
      
      For the purpose of a simple test, we set up above IPs as service IPs and
      transparently redirect traffic to a different DNS backend server for that
      node:
      
        # cilium service list
        ID   Frontend            Backend
        1    147.75.207.207:53   1 => 8.8.8.8:53
        2    147.75.207.208:53   1 => 8.8.8.8:53
      
      The attached BPF program is basically selecting one of the backends if the
      service IP/port matches on the cgroup hook. DNS breaks here, because the
      hooks are not transparent enough to applications which have built-in msg_name
      address checks:
      
        # nslookup 1.1.1.1
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.208#53
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
        [...]
        ;; connection timed out; no servers could be reached
      
        # dig 1.1.1.1
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.208#53
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
        [...]
      
        ; <<>> DiG 9.11.3-1ubuntu1.7-Ubuntu <<>> 1.1.1.1
        ;; global options: +cmd
        ;; connection timed out; no servers could be reached
      
      For comparison, if none of the service IPs is used, and we tell nslookup
      to use 8.8.8.8 directly it works just fine, of course:
      
        # nslookup 1.1.1.1 8.8.8.8
        1.1.1.1.in-addr.arpa	name = one.one.one.one.
      
      In order to fix this and thus act more transparent to the application,
      this needs reverse translation on recvmsg() side. A minimal fix for this
      API is to add similar recvmsg() hooks behind the BPF cgroups static key
      such that the program can track state and replace the current sockaddr_in{,6}
      with the original service IP. From BPF side, this basically tracks the
      service tuple plus socket cookie in an LRU map where the reverse NAT can
      then be retrieved via map value as one example. Side-note: the BPF cgroups
      static key should be converted to a per-hook static key in future.
      
      Same example after this fix:
      
        # cilium service list
        ID   Frontend            Backend
        1    147.75.207.207:53   1 => 8.8.8.8:53
        2    147.75.207.208:53   1 => 8.8.8.8:53
      
      Lookups work fine now:
      
        # nslookup 1.1.1.1
        1.1.1.1.in-addr.arpa    name = one.one.one.one.
      
        Authoritative answers can be found from:
      
        # dig 1.1.1.1
      
        ; <<>> DiG 9.11.3-1ubuntu1.7-Ubuntu <<>> 1.1.1.1
        ;; global options: +cmd
        ;; Got answer:
        ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 51550
        ;; flags: qr rd ra ad; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1
      
        ;; OPT PSEUDOSECTION:
        ; EDNS: version: 0, flags:; udp: 512
        ;; QUESTION SECTION:
        ;1.1.1.1.                       IN      A
      
        ;; AUTHORITY SECTION:
        .                       23426   IN      SOA     a.root-servers.net. nstld.verisign-grs.com. 2019052001 1800 900 604800 86400
      
        ;; Query time: 17 msec
        ;; SERVER: 147.75.207.207#53(147.75.207.207)
        ;; WHEN: Tue May 21 12:59:38 UTC 2019
        ;; MSG SIZE  rcvd: 111
      
      And from an actual packet level it shows that we're using the back end
      server when talking via 147.75.207.20{7,8} front end:
      
        # tcpdump -i any udp
        [...]
        12:59:52.698732 IP foo.42011 > google-public-dns-a.google.com.domain: 18803+ PTR? 1.1.1.1.in-addr.arpa. (38)
        12:59:52.698735 IP foo.42011 > google-public-dns-a.google.com.domain: 18803+ PTR? 1.1.1.1.in-addr.arpa. (38)
        12:59:52.701208 IP google-public-dns-a.google.com.domain > foo.42011: 18803 1/0/0 PTR one.one.one.one. (67)
        12:59:52.701208 IP google-public-dns-a.google.com.domain > foo.42011: 18803 1/0/0 PTR one.one.one.one. (67)
        [...]
      
      In order to be flexible and to have same semantics as in sendmsg BPF
      programs, we only allow return codes in [1,1] range. In the sendmsg case
      the program is called if msg->msg_name is present which can be the case
      in both, connected and unconnected UDP.
      
      The former only relies on the sockaddr_in{,6} passed via connect(2) if
      passed msg->msg_name was NULL. Therefore, on recvmsg side, we act in similar
      way to call into the BPF program whenever a non-NULL msg->msg_name was
      passed independent of sk->sk_state being TCP_ESTABLISHED or not. Note
      that for TCP case, the msg->msg_name is ignored in the regular recvmsg
      path and therefore not relevant.
      
      For the case of ip{,v6}_recv_error() paths, picked up via MSG_ERRQUEUE,
      the hook is not called. This is intentional as it aligns with the same
      semantics as in case of TCP cgroup BPF hooks right now. This might be
      better addressed in future through a different bpf_attach_type such
      that this case can be distinguished from the regular recvmsg paths,
      for example.
      
      Fixes: 1cedee13 ("bpf: Hooks for sys_sendmsg")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAndrey Ignatov <rdna@fb.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NMartynas Pumputis <m@lambda.lt>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      983695fa
  26. 25 5月, 2019 2 次提交
    • J
      bpf: introduce new bpf prog load flags "BPF_F_TEST_RND_HI32" · c240eff6
      Jiong Wang 提交于
      x86_64 and AArch64 perhaps are two arches that running bpf testsuite
      frequently, however the zero extension insertion pass is not enabled for
      them because of their hardware support.
      
      It is critical to guarantee the pass correction as it is supposed to be
      enabled at default for a couple of other arches, for example PowerPC,
      SPARC, arm, NFP etc. Therefore, it would be very useful if there is a way
      to test this pass on for example x86_64.
      
      The test methodology employed by this set is "poisoning" useless bits. High
      32-bit of a definition is randomized if it is identified as not used by any
      later insn. Such randomization is only enabled under testing mode which is
      gated by the new bpf prog load flags "BPF_F_TEST_RND_HI32".
      Suggested-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NJiong Wang <jiong.wang@netronome.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      c240eff6
    • Y
      bpf: implement bpf_send_signal() helper · 8b401f9e
      Yonghong Song 提交于
      This patch tries to solve the following specific use case.
      
      Currently, bpf program can already collect stack traces
      through kernel function get_perf_callchain()
      when certain events happens (e.g., cache miss counter or
      cpu clock counter overflows). But such stack traces are
      not enough for jitted programs, e.g., hhvm (jited php).
      To get real stack trace, jit engine internal data structures
      need to be traversed in order to get the real user functions.
      
      bpf program itself may not be the best place to traverse
      the jit engine as the traversing logic could be complex and
      it is not a stable interface either.
      
      Instead, hhvm implements a signal handler,
      e.g. for SIGALARM, and a set of program locations which
      it can dump stack traces. When it receives a signal, it will
      dump the stack in next such program location.
      
      Such a mechanism can be implemented in the following way:
        . a perf ring buffer is created between bpf program
          and tracing app.
        . once a particular event happens, bpf program writes
          to the ring buffer and the tracing app gets notified.
        . the tracing app sends a signal SIGALARM to the hhvm.
      
      But this method could have large delays and causing profiling
      results skewed.
      
      This patch implements bpf_send_signal() helper to send
      a signal to hhvm in real time, resulting in intended stack traces.
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      8b401f9e
  27. 13 5月, 2019 2 次提交
  28. 28 4月, 2019 1 次提交
    • M
      bpf: Introduce bpf sk local storage · 6ac99e8f
      Martin KaFai Lau 提交于
      After allowing a bpf prog to
      - directly read the skb->sk ptr
      - get the fullsock bpf_sock by "bpf_sk_fullsock()"
      - get the bpf_tcp_sock by "bpf_tcp_sock()"
      - get the listener sock by "bpf_get_listener_sock()"
      - avoid duplicating the fields of "(bpf_)sock" and "(bpf_)tcp_sock"
        into different bpf running context.
      
      this patch is another effort to make bpf's network programming
      more intuitive to do (together with memory and performance benefit).
      
      When bpf prog needs to store data for a sk, the current practice is to
      define a map with the usual 4-tuples (src/dst ip/port) as the key.
      If multiple bpf progs require to store different sk data, multiple maps
      have to be defined.  Hence, wasting memory to store the duplicated
      keys (i.e. 4 tuples here) in each of the bpf map.
      [ The smallest key could be the sk pointer itself which requires
        some enhancement in the verifier and it is a separate topic. ]
      
      Also, the bpf prog needs to clean up the elem when sk is freed.
      Otherwise, the bpf map will become full and un-usable quickly.
      The sk-free tracking currently could be done during sk state
      transition (e.g. BPF_SOCK_OPS_STATE_CB).
      
      The size of the map needs to be predefined which then usually ended-up
      with an over-provisioned map in production.  Even the map was re-sizable,
      while the sk naturally come and go away already, this potential re-size
      operation is arguably redundant if the data can be directly connected
      to the sk itself instead of proxy-ing through a bpf map.
      
      This patch introduces sk->sk_bpf_storage to provide local storage space
      at sk for bpf prog to use.  The space will be allocated when the first bpf
      prog has created data for this particular sk.
      
      The design optimizes the bpf prog's lookup (and then optionally followed by
      an inline update).  bpf_spin_lock should be used if the inline update needs
      to be protected.
      
      BPF_MAP_TYPE_SK_STORAGE:
      -----------------------
      To define a bpf "sk-local-storage", a BPF_MAP_TYPE_SK_STORAGE map (new in
      this patch) needs to be created.  Multiple BPF_MAP_TYPE_SK_STORAGE maps can
      be created to fit different bpf progs' needs.  The map enforces
      BTF to allow printing the sk-local-storage during a system-wise
      sk dump (e.g. "ss -ta") in the future.
      
      The purpose of a BPF_MAP_TYPE_SK_STORAGE map is not for lookup/update/delete
      a "sk-local-storage" data from a particular sk.
      Think of the map as a meta-data (or "type") of a "sk-local-storage".  This
      particular "type" of "sk-local-storage" data can then be stored in any sk.
      
      The main purposes of this map are mostly:
      1. Define the size of a "sk-local-storage" type.
      2. Provide a similar syscall userspace API as the map (e.g. lookup/update,
         map-id, map-btf...etc.)
      3. Keep track of all sk's storages of this "type" and clean them up
         when the map is freed.
      
      sk->sk_bpf_storage:
      ------------------
      The main lookup/update/delete is done on sk->sk_bpf_storage (which
      is a "struct bpf_sk_storage").  When doing a lookup,
      the "map" pointer is now used as the "key" to search on the
      sk_storage->list.  The "map" pointer is actually serving
      as the "type" of the "sk-local-storage" that is being
      requested.
      
      To allow very fast lookup, it should be as fast as looking up an
      array at a stable-offset.  At the same time, it is not ideal to
      set a hard limit on the number of sk-local-storage "type" that the
      system can have.  Hence, this patch takes a cache approach.
      The last search result from sk_storage->list is cached in
      sk_storage->cache[] which is a stable sized array.  Each
      "sk-local-storage" type has a stable offset to the cache[] array.
      In the future, a map's flag could be introduced to do cache
      opt-out/enforcement if it became necessary.
      
      The cache size is 16 (i.e. 16 types of "sk-local-storage").
      Programs can share map.  On the program side, having a few bpf_progs
      running in the networking hotpath is already a lot.  The bpf_prog
      should have already consolidated the existing sock-key-ed map usage
      to minimize the map lookup penalty.  16 has enough runway to grow.
      
      All sk-local-storage data will be removed from sk->sk_bpf_storage
      during sk destruction.
      
      bpf_sk_storage_get() and bpf_sk_storage_delete():
      ------------------------------------------------
      Instead of using bpf_map_(lookup|update|delete)_elem(),
      the bpf prog needs to use the new helper bpf_sk_storage_get() and
      bpf_sk_storage_delete().  The verifier can then enforce the
      ARG_PTR_TO_SOCKET argument.  The bpf_sk_storage_get() also allows to
      "create" new elem if one does not exist in the sk.  It is done by
      the new BPF_SK_STORAGE_GET_F_CREATE flag.  An optional value can also be
      provided as the initial value during BPF_SK_STORAGE_GET_F_CREATE.
      The BPF_MAP_TYPE_SK_STORAGE also supports bpf_spin_lock.  Together,
      it has eliminated the potential use cases for an equivalent
      bpf_map_update_elem() API (for bpf_prog) in this patch.
      
      Misc notes:
      ----------
      1. map_get_next_key is not supported.  From the userspace syscall
         perspective,  the map has the socket fd as the key while the map
         can be shared by pinned-file or map-id.
      
         Since btf is enforced, the existing "ss" could be enhanced to pretty
         print the local-storage.
      
         Supporting a kernel defined btf with 4 tuples as the return key could
         be explored later also.
      
      2. The sk->sk_lock cannot be acquired.  Atomic operations is used instead.
         e.g. cmpxchg is done on the sk->sk_bpf_storage ptr.
         Please refer to the source code comments for the details in
         synchronization cases and considerations.
      
      3. The mem is charged to the sk->sk_omem_alloc as the sk filter does.
      
      Benchmark:
      ---------
      Here is the benchmark data collected by turning on
      the "kernel.bpf_stats_enabled" sysctl.
      Two bpf progs are tested:
      
      One bpf prog with the usual bpf hashmap (max_entries = 8192) with the
      sk ptr as the key. (verifier is modified to support sk ptr as the key
      That should have shortened the key lookup time.)
      
      Another bpf prog is with the new BPF_MAP_TYPE_SK_STORAGE.
      
      Both are storing a "u32 cnt", do a lookup on "egress_skb/cgroup" for
      each egress skb and then bump the cnt.  netperf is used to drive
      data with 4096 connected UDP sockets.
      
      BPF_MAP_TYPE_HASH with a modifier verifier (152ns per bpf run)
      27: cgroup_skb  name egress_sk_map  tag 74f56e832918070b run_time_ns 58280107540 run_cnt 381347633
          loaded_at 2019-04-15T13:46:39-0700  uid 0
          xlated 344B  jited 258B  memlock 4096B  map_ids 16
          btf_id 5
      
      BPF_MAP_TYPE_SK_STORAGE in this patch (66ns per bpf run)
      30: cgroup_skb  name egress_sk_stora  tag d4aa70984cc7bbf6 run_time_ns 25617093319 run_cnt 390989739
          loaded_at 2019-04-15T13:47:54-0700  uid 0
          xlated 168B  jited 156B  memlock 4096B  map_ids 17
          btf_id 6
      
      Here is a high-level picture on how are the objects organized:
      
             sk
          ┌──────┐
          │      │
          │      │
          │      │
          │*sk_bpf_storage───── bpf_sk_storage
          └──────┘                 ┌───────┐
                       ┌───────────┤ list  │
                       │           │       │
                       │           │       │
                       │           │       │
                       │           └───────┘
                       │
                       │     elem
                       │  ┌────────┐
                       ├─│ snode  │
                       │  ├────────┤
                       │  │  data  │          bpf_map
                       │  ├────────┤        ┌─────────┐
                       │  │map_node│─┬─────┤  list   │
                       │  └────────┘  │     │         │
                       │              │     │         │
                       │     elem     │     │         │
                       │  ┌────────┐  │     └─────────┘
                       └─│ snode  │  │
                          ├────────┤  │
         bpf_map          │  data  │  │
       ┌─────────┐        ├────────┤  │
       │  list   ├───────│map_node│  │
       │         │        └────────┘  │
       │         │                    │
       │         │           elem     │
       └─────────┘        ┌────────┐  │
                       ┌─│ snode  │  │
                       │  ├────────┤  │
                       │  │  data  │  │
                       │  ├────────┤  │
                       │  │map_node│─┘
                       │  └────────┘
                       │
                       │
                       │          ┌───────┐
           sk          └──────────│ list  │
        ┌──────┐                  │       │
        │      │                  │       │
        │      │                  │       │
        │      │                  └───────┘
        │*sk_bpf_storage───────bpf_sk_storage
        └──────┘
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      6ac99e8f
  29. 27 4月, 2019 1 次提交
    • M
      bpf: add writable context for raw tracepoints · 9df1c28b
      Matt Mullins 提交于
      This is an opt-in interface that allows a tracepoint to provide a safe
      buffer that can be written from a BPF_PROG_TYPE_RAW_TRACEPOINT program.
      The size of the buffer must be a compile-time constant, and is checked
      before allowing a BPF program to attach to a tracepoint that uses this
      feature.
      
      The pointer to this buffer will be the first argument of tracepoints
      that opt in; the pointer is valid and can be bpf_probe_read() by both
      BPF_PROG_TYPE_RAW_TRACEPOINT and BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE
      programs that attach to such a tracepoint, but the buffer to which it
      points may only be written by the latter.
      Signed-off-by: NMatt Mullins <mmullins@fb.com>
      Acked-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      9df1c28b