1. 27 4月, 2018 1 次提交
  2. 25 4月, 2018 1 次提交
    • E
      bpf: add helper for getting xfrm states · 12bed760
      Eyal Birger 提交于
      This commit introduces a helper which allows fetching xfrm state
      parameters by eBPF programs attached to TC.
      
      Prototype:
      bpf_skb_get_xfrm_state(skb, index, xfrm_state, size, flags)
      
      skb: pointer to skb
      index: the index in the skb xfrm_state secpath array
      xfrm_state: pointer to 'struct bpf_xfrm_state'
      size: size of 'struct bpf_xfrm_state'
      flags: reserved for future extensions
      
      The helper returns 0 on success. Non zero if no xfrm state at the index
      is found - or non exists at all.
      
      struct bpf_xfrm_state currently includes the SPI, peer IPv4/IPv6
      address and the reqid; it can be further extended by adding elements to
      its end - indicating the populated fields by the 'size' argument -
      keeping backwards compatibility.
      
      Typical usage:
      
      struct bpf_xfrm_state x = {};
      bpf_skb_get_xfrm_state(skb, 0, &x, sizeof(x), 0);
      ...
      Signed-off-by: NEyal Birger <eyal.birger@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      12bed760
  3. 20 4月, 2018 2 次提交
    • M
      bpf: btf: Add pretty print support to the basic arraymap · a26ca7c9
      Martin KaFai Lau 提交于
      This patch adds pretty print support to the basic arraymap.
      Support for other bpf maps can be added later.
      
      This patch adds new attrs to the BPF_MAP_CREATE command to allow
      specifying the btf_fd, btf_key_id and btf_value_id.  The
      BPF_MAP_CREATE can then associate the btf to the map if
      the creating map supports BTF.
      
      A BTF supported map needs to implement two new map ops,
      map_seq_show_elem() and map_check_btf().  This patch has
      implemented these new map ops for the basic arraymap.
      
      It also adds file_operations, bpffs_map_fops, to the pinned
      map such that the pinned map can be opened and read.
      After that, the user has an intuitive way to do
      "cat bpffs/pathto/a-pinned-map" instead of getting
      an error.
      
      bpffs_map_fops should not be extended further to support
      other operations.  Other operations (e.g. write/key-lookup...)
      should be realized by the userspace tools (e.g. bpftool) through
      the BPF_OBJ_GET_INFO_BY_FD, map's lookup/update interface...etc.
      Follow up patches will allow the userspace to obtain
      the BTF from a map-fd.
      
      Here is a sample output when reading a pinned arraymap
      with the following map's value:
      
      struct map_value {
      	int count_a;
      	int count_b;
      };
      
      cat /sys/fs/bpf/pinned_array_map:
      
      0: {1,2}
      1: {3,4}
      2: {5,6}
      ...
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      a26ca7c9
    • M
      bpf: btf: Add BPF_BTF_LOAD command · f56a653c
      Martin KaFai Lau 提交于
      This patch adds a BPF_BTF_LOAD command which
      1) loads and verifies the BTF (implemented in earlier patches)
      2) returns a BTF fd to userspace.  In the next patch, the
         BTF fd can be specified during BPF_MAP_CREATE.
      
      It currently limits to CAP_SYS_ADMIN.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      f56a653c
  4. 19 4月, 2018 1 次提交
  5. 31 3月, 2018 4 次提交
    • A
      bpf: Post-hooks for sys_bind · aac3fc32
      Andrey Ignatov 提交于
      "Post-hooks" are hooks that are called right before returning from
      sys_bind. At this time IP and port are already allocated and no further
      changes to `struct sock` can happen before returning from sys_bind but
      BPF program has a chance to inspect the socket and change sys_bind
      result.
      
      Specifically it can e.g. inspect what port was allocated and if it
      doesn't satisfy some policy, BPF program can force sys_bind to fail and
      return EPERM to user.
      
      Another example of usage is recording the IP:port pair to some map to
      use it in later calls to sys_connect. E.g. if some TCP server inside
      cgroup was bound to some IP:port_n, it can be recorded to a map. And
      later when some TCP client inside same cgroup is trying to connect to
      127.0.0.1:port_n, BPF hook for sys_connect can override the destination
      and connect application to IP:port_n instead of 127.0.0.1:port_n. That
      helps forcing all applications inside a cgroup to use desired IP and not
      break those applications if they e.g. use localhost to communicate
      between each other.
      
      == Implementation details ==
      
      Post-hooks are implemented as two new attach types
      `BPF_CGROUP_INET4_POST_BIND` and `BPF_CGROUP_INET6_POST_BIND` for
      existing prog type `BPF_PROG_TYPE_CGROUP_SOCK`.
      
      Separate attach types for IPv4 and IPv6 are introduced to avoid access
      to IPv6 field in `struct sock` from `inet_bind()` and to IPv4 field from
      `inet6_bind()` since those fields might not make sense in such cases.
      Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      aac3fc32
    • A
      bpf: Hooks for sys_connect · d74bad4e
      Andrey Ignatov 提交于
      == The problem ==
      
      See description of the problem in the initial patch of this patch set.
      
      == The solution ==
      
      The patch provides much more reliable in-kernel solution for the 2nd
      part of the problem: making outgoing connecttion from desired IP.
      
      It adds new attach types `BPF_CGROUP_INET4_CONNECT` and
      `BPF_CGROUP_INET6_CONNECT` for program type
      `BPF_PROG_TYPE_CGROUP_SOCK_ADDR` that can be used to override both
      source and destination of a connection at connect(2) time.
      
      Local end of connection can be bound to desired IP using newly
      introduced BPF-helper `bpf_bind()`. It allows to bind to only IP though,
      and doesn't support binding to port, i.e. leverages
      `IP_BIND_ADDRESS_NO_PORT` socket option. There are two reasons for this:
      * looking for a free port is expensive and can affect performance
        significantly;
      * there is no use-case for port.
      
      As for remote end (`struct sockaddr *` passed by user), both parts of it
      can be overridden, remote IP and remote port. It's useful if an
      application inside cgroup wants to connect to another application inside
      same cgroup or to itself, but knows nothing about IP assigned to the
      cgroup.
      
      Support is added for IPv4 and IPv6, for TCP and UDP.
      
      IPv4 and IPv6 have separate attach types for same reason as sys_bind
      hooks, i.e. to prevent reading from / writing to e.g. user_ip6 fields
      when user passes sockaddr_in since it'd be out-of-bound.
      
      == Implementation notes ==
      
      The patch introduces new field in `struct proto`: `pre_connect` that is
      a pointer to a function with same signature as `connect` but is called
      before it. The reason is in some cases BPF hooks should be called way
      before control is passed to `sk->sk_prot->connect`. Specifically
      `inet_dgram_connect` autobinds socket before calling
      `sk->sk_prot->connect` and there is no way to call `bpf_bind()` from
      hooks from e.g. `ip4_datagram_connect` or `ip6_datagram_connect` since
      it'd cause double-bind. On the other hand `proto.pre_connect` provides a
      flexible way to add BPF hooks for connect only for necessary `proto` and
      call them at desired time before `connect`. Since `bpf_bind()` is
      allowed to bind only to IP and autobind in `inet_dgram_connect` binds
      only port there is no chance of double-bind.
      
      bpf_bind() sets `force_bind_address_no_port` to bind to only IP despite
      of value of `bind_address_no_port` socket field.
      
      bpf_bind() sets `with_lock` to `false` when calling to __inet_bind()
      and __inet6_bind() since all call-sites, where bpf_bind() is called,
      already hold socket lock.
      Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      d74bad4e
    • A
      bpf: Hooks for sys_bind · 4fbac77d
      Andrey Ignatov 提交于
      == The problem ==
      
      There is a use-case when all processes inside a cgroup should use one
      single IP address on a host that has multiple IP configured.  Those
      processes should use the IP for both ingress and egress, for TCP and UDP
      traffic. So TCP/UDP servers should be bound to that IP to accept
      incoming connections on it, and TCP/UDP clients should make outgoing
      connections from that IP. It should not require changing application
      code since it's often not possible.
      
      Currently it's solved by intercepting glibc wrappers around syscalls
      such as `bind(2)` and `connect(2)`. It's done by a shared library that
      is preloaded for every process in a cgroup so that whenever TCP/UDP
      server calls `bind(2)`, the library replaces IP in sockaddr before
      passing arguments to syscall. When application calls `connect(2)` the
      library transparently binds the local end of connection to that IP
      (`bind(2)` with `IP_BIND_ADDRESS_NO_PORT` to avoid performance penalty).
      
      Shared library approach is fragile though, e.g.:
      * some applications clear env vars (incl. `LD_PRELOAD`);
      * `/etc/ld.so.preload` doesn't help since some applications are linked
        with option `-z nodefaultlib`;
      * other applications don't use glibc and there is nothing to intercept.
      
      == The solution ==
      
      The patch provides much more reliable in-kernel solution for the 1st
      part of the problem: binding TCP/UDP servers on desired IP. It does not
      depend on application environment and implementation details (whether
      glibc is used or not).
      
      It adds new eBPF program type `BPF_PROG_TYPE_CGROUP_SOCK_ADDR` and
      attach types `BPF_CGROUP_INET4_BIND` and `BPF_CGROUP_INET6_BIND`
      (similar to already existing `BPF_CGROUP_INET_SOCK_CREATE`).
      
      The new program type is intended to be used with sockets (`struct sock`)
      in a cgroup and provided by user `struct sockaddr`. Pointers to both of
      them are parts of the context passed to programs of newly added types.
      
      The new attach types provides hooks in `bind(2)` system call for both
      IPv4 and IPv6 so that one can write a program to override IP addresses
      and ports user program tries to bind to and apply such a program for
      whole cgroup.
      
      == Implementation notes ==
      
      [1]
      Separate attach types for `AF_INET` and `AF_INET6` are added
      intentionally to prevent reading/writing to offsets that don't make
      sense for corresponding socket family. E.g. if user passes `sockaddr_in`
      it doesn't make sense to read from / write to `user_ip6[]` context
      fields.
      
      [2]
      The write access to `struct bpf_sock_addr_kern` is implemented using
      special field as an additional "register".
      
      There are just two registers in `sock_addr_convert_ctx_access`: `src`
      with value to write and `dst` with pointer to context that can't be
      changed not to break later instructions. But the fields, allowed to
      write to, are not available directly and to access them address of
      corresponding pointer has to be loaded first. To get additional register
      the 1st not used by `src` and `dst` one is taken, its content is saved
      to `bpf_sock_addr_kern.tmp_reg`, then the register is used to load
      address of pointer field, and finally the register's content is restored
      from the temporary field after writing `src` value.
      Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      4fbac77d
    • A
      bpf: Check attach type at prog load time · 5e43f899
      Andrey Ignatov 提交于
      == The problem ==
      
      There are use-cases when a program of some type can be attached to
      multiple attach points and those attach points must have different
      permissions to access context or to call helpers.
      
      E.g. context structure may have fields for both IPv4 and IPv6 but it
      doesn't make sense to read from / write to IPv6 field when attach point
      is somewhere in IPv4 stack.
      
      Same applies to BPF-helpers: it may make sense to call some helper from
      some attach point, but not from other for same prog type.
      
      == The solution ==
      
      Introduce `expected_attach_type` field in in `struct bpf_attr` for
      `BPF_PROG_LOAD` command. If scenario described in "The problem" section
      is the case for some prog type, the field will be checked twice:
      
      1) At load time prog type is checked to see if attach type for it must
         be known to validate program permissions correctly. Prog will be
         rejected with EINVAL if it's the case and `expected_attach_type` is
         not specified or has invalid value.
      
      2) At attach time `attach_type` is compared with `expected_attach_type`,
         if prog type requires to have one, and, if they differ, attach will
         be rejected with EINVAL.
      
      The `expected_attach_type` is now available as part of `struct bpf_prog`
      in both `bpf_verifier_ops->is_valid_access()` and
      `bpf_verifier_ops->get_func_proto()` () and can be used to check context
      accesses and calls to helpers correspondingly.
      
      Initially the idea was discussed by Alexei Starovoitov <ast@fb.com> and
      Daniel Borkmann <daniel@iogearbox.net> here:
      https://marc.info/?l=linux-netdev&m=152107378717201&w=2Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      5e43f899
  6. 29 3月, 2018 1 次提交
    • A
      bpf: introduce BPF_RAW_TRACEPOINT · c4f6699d
      Alexei Starovoitov 提交于
      Introduce BPF_PROG_TYPE_RAW_TRACEPOINT bpf program type to access
      kernel internal arguments of the tracepoints in their raw form.
      
      >From bpf program point of view the access to the arguments look like:
      struct bpf_raw_tracepoint_args {
             __u64 args[0];
      };
      
      int bpf_prog(struct bpf_raw_tracepoint_args *ctx)
      {
        // program can read args[N] where N depends on tracepoint
        // and statically verified at program load+attach time
      }
      
      kprobe+bpf infrastructure allows programs access function arguments.
      This feature allows programs access raw tracepoint arguments.
      
      Similar to proposed 'dynamic ftrace events' there are no abi guarantees
      to what the tracepoints arguments are and what their meaning is.
      The program needs to type cast args properly and use bpf_probe_read()
      helper to access struct fields when argument is a pointer.
      
      For every tracepoint __bpf_trace_##call function is prepared.
      In assembler it looks like:
      (gdb) disassemble __bpf_trace_xdp_exception
      Dump of assembler code for function __bpf_trace_xdp_exception:
         0xffffffff81132080 <+0>:     mov    %ecx,%ecx
         0xffffffff81132082 <+2>:     jmpq   0xffffffff811231f0 <bpf_trace_run3>
      
      where
      
      TRACE_EVENT(xdp_exception,
              TP_PROTO(const struct net_device *dev,
                       const struct bpf_prog *xdp, u32 act),
      
      The above assembler snippet is casting 32-bit 'act' field into 'u64'
      to pass into bpf_trace_run3(), while 'dev' and 'xdp' args are passed as-is.
      All of ~500 of __bpf_trace_*() functions are only 5-10 byte long
      and in total this approach adds 7k bytes to .text.
      
      This approach gives the lowest possible overhead
      while calling trace_xdp_exception() from kernel C code and
      transitioning into bpf land.
      Since tracepoint+bpf are used at speeds of 1M+ events per second
      this is valuable optimization.
      
      The new BPF_RAW_TRACEPOINT_OPEN sys_bpf command is introduced
      that returns anon_inode FD of 'bpf-raw-tracepoint' object.
      
      The user space looks like:
      // load bpf prog with BPF_PROG_TYPE_RAW_TRACEPOINT type
      prog_fd = bpf_prog_load(...);
      // receive anon_inode fd for given bpf_raw_tracepoint with prog attached
      raw_tp_fd = bpf_raw_tracepoint_open("xdp_exception", prog_fd);
      
      Ctrl-C of tracing daemon or cmdline tool that uses this feature
      will automatically detach bpf program, unload it and
      unregister tracepoint probe.
      
      On the kernel side the __bpf_raw_tp_map section of pointers to
      tracepoint definition and to __bpf_trace_*() probe function is used
      to find a tracepoint with "xdp_exception" name and
      corresponding __bpf_trace_xdp_exception() probe function
      which are passed to tracepoint_probe_register() to connect probe
      with tracepoint.
      
      Addition of bpf_raw_tracepoint doesn't interfere with ftrace and perf
      tracepoint mechanisms. perf_event_open() can be used in parallel
      on the same tracepoint.
      Multiple bpf_raw_tracepoint_open("xdp_exception", prog_fd) are permitted.
      Each with its own bpf program. The kernel will execute
      all tracepoint probes and all attached bpf programs.
      
      In the future bpf_raw_tracepoints can be extended with
      query/introspection logic.
      
      __bpf_raw_tp_map section logic was contributed by Steven Rostedt
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Acked-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      c4f6699d
  7. 20 3月, 2018 4 次提交
    • J
      bpf: sk_msg program helper bpf_sk_msg_pull_data · 015632bb
      John Fastabend 提交于
      Currently, if a bpf sk msg program is run the program
      can only parse data that the (start,end) pointers already
      consumed. For sendmsg hooks this is likely the first
      scatterlist element. For sendpage this will be the range
      (0,0) because the data is shared with userspace and by
      default we want to avoid allowing userspace to modify
      data while (or after) BPF verdict is being decided.
      
      To support pulling in additional bytes for parsing use
      a new helper bpf_sk_msg_pull(start, end, flags) which
      works similar to cls tc logic. This helper will attempt
      to point the data start pointer at 'start' bytes offest
      into msg and data end pointer at 'end' bytes offset into
      message.
      
      After basic sanity checks to ensure 'start' <= 'end' and
      'end' <= msg_length there are a few cases we need to
      handle.
      
      First the sendmsg hook has already copied the data from
      userspace and has exclusive access to it. Therefor, it
      is not necessesary to copy the data. However, it may
      be required. After finding the scatterlist element with
      'start' offset byte in it there are two cases. One the
      range (start,end) is entirely contained in the sg element
      and is already linear. All that is needed is to update the
      data pointers, no allocate/copy is needed. The other case
      is (start, end) crosses sg element boundaries. In this
      case we allocate a block of size 'end - start' and copy
      the data to linearize it.
      
      Next sendpage hook has not copied any data in initial
      state so that data pointers are (0,0). In this case we
      handle it similar to the above sendmsg case except the
      allocation/copy must always happen. Then when sending
      the data we have possibly three memory regions that
      need to be sent, (0, start - 1), (start, end), and
      (end + 1, msg_length). This is required to ensure any
      writes by the BPF program are correctly transmitted.
      
      Lastly this operation will invalidate any previous
      data checks so BPF programs will have to revalidate
      pointers after making this BPF call.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      015632bb
    • J
      bpf: sockmap, add msg_cork_bytes() helper · 91843d54
      John Fastabend 提交于
      In the case where we need a specific number of bytes before a
      verdict can be assigned, even if the data spans multiple sendmsg
      or sendfile calls. The BPF program may use msg_cork_bytes().
      
      The extreme case is a user can call sendmsg repeatedly with
      1-byte msg segments. Obviously, this is bad for performance but
      is still valid. If the BPF program needs N bytes to validate
      a header it can use msg_cork_bytes to specify N bytes and the
      BPF program will not be called again until N bytes have been
      accumulated. The infrastructure will attempt to coalesce data
      if possible so in many cases (most my use cases at least) the
      data will be in a single scatterlist element with data pointers
      pointing to start/end of the element. However, this is dependent
      on available memory so is not guaranteed. So BPF programs must
      validate data pointer ranges, but this is the case anyways to
      convince the verifier the accesses are valid.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      91843d54
    • J
      bpf: sockmap, add bpf_msg_apply_bytes() helper · 2a100317
      John Fastabend 提交于
      A single sendmsg or sendfile system call can contain multiple logical
      messages that a BPF program may want to read and apply a verdict. But,
      without an apply_bytes helper any verdict on the data applies to all
      bytes in the sendmsg/sendfile. Alternatively, a BPF program may only
      care to read the first N bytes of a msg. If the payload is large say
      MB or even GB setting up and calling the BPF program repeatedly for
      all bytes, even though the verdict is already known, creates
      unnecessary overhead.
      
      To allow BPF programs to control how many bytes a given verdict
      applies to we implement a bpf_msg_apply_bytes() helper. When called
      from within a BPF program this sets a counter, internal to the
      BPF infrastructure, that applies the last verdict to the next N
      bytes. If the N is smaller than the current data being processed
      from a sendmsg/sendfile call, the first N bytes will be sent and
      the BPF program will be re-run with start_data pointing to the N+1
      byte. If N is larger than the current data being processed the
      BPF verdict will be applied to multiple sendmsg/sendfile calls
      until N bytes are consumed.
      
      Note1 if a socket closes with apply_bytes counter non-zero this
      is not a problem because data is not being buffered for N bytes
      and is sent as its received.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      2a100317
    • J
      bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data · 4f738adb
      John Fastabend 提交于
      This implements a BPF ULP layer to allow policy enforcement and
      monitoring at the socket layer. In order to support this a new
      program type BPF_PROG_TYPE_SK_MSG is used to run the policy at
      the sendmsg/sendpage hook. To attach the policy to sockets a
      sockmap is used with a new program attach type BPF_SK_MSG_VERDICT.
      
      Similar to previous sockmap usages when a sock is added to a
      sockmap, via a map update, if the map contains a BPF_SK_MSG_VERDICT
      program type attached then the BPF ULP layer is created on the
      socket and the attached BPF_PROG_TYPE_SK_MSG program is run for
      every msg in sendmsg case and page/offset in sendpage case.
      
      BPF_PROG_TYPE_SK_MSG Semantics/API:
      
      BPF_PROG_TYPE_SK_MSG supports only two return codes SK_PASS and
      SK_DROP. Returning SK_DROP free's the copied data in the sendmsg
      case and in the sendpage case leaves the data untouched. Both cases
      return -EACESS to the user. Returning SK_PASS will allow the msg to
      be sent.
      
      In the sendmsg case data is copied into kernel space buffers before
      running the BPF program. The kernel space buffers are stored in a
      scatterlist object where each element is a kernel memory buffer.
      Some effort is made to coalesce data from the sendmsg call here.
      For example a sendmsg call with many one byte iov entries will
      likely be pushed into a single entry. The BPF program is run with
      data pointers (start/end) pointing to the first sg element.
      
      In the sendpage case data is not copied. We opt not to copy the
      data by default here, because the BPF infrastructure does not
      know what bytes will be needed nor when they will be needed. So
      copying all bytes may be wasteful. Because of this the initial
      start/end data pointers are (0,0). Meaning no data can be read or
      written. This avoids reading data that may be modified by the
      user. A new helper is added later in this series if reading and
      writing the data is needed. The helper call will do a copy by
      default so that the page is exclusively owned by the BPF call.
      
      The verdict from the BPF_PROG_TYPE_SK_MSG applies to the entire msg
      in the sendmsg() case and the entire page/offset in the sendpage case.
      This avoids ambiguity on how to handle mixed return codes in the
      sendmsg case. Again a helper is added later in the series if
      a verdict needs to apply to multiple system calls and/or only
      a subpart of the currently being processed message.
      
      The helper msg_redirect_map() can be used to select the socket to
      send the data on. This is used similar to existing redirect use
      cases. This allows policy to redirect msgs.
      
      Pseudo code simple example:
      
      The basic logic to attach a program to a socket is as follows,
      
        // load the programs
        bpf_prog_load(SOCKMAP_TCP_MSG_PROG, BPF_PROG_TYPE_SK_MSG,
      		&obj, &msg_prog);
      
        // lookup the sockmap
        bpf_map_msg = bpf_object__find_map_by_name(obj, "my_sock_map");
      
        // get fd for sockmap
        map_fd_msg = bpf_map__fd(bpf_map_msg);
      
        // attach program to sockmap
        bpf_prog_attach(msg_prog, map_fd_msg, BPF_SK_MSG_VERDICT, 0);
      
      Adding sockets to the map is done in the normal way,
      
        // Add a socket 'fd' to sockmap at location 'i'
        bpf_map_update_elem(map_fd_msg, &i, fd, BPF_ANY);
      
      After the above any socket attached to "my_sock_map", in this case
      'fd', will run the BPF msg verdict program (msg_prog) on every
      sendmsg and sendpage system call.
      
      For a complete example see BPF selftests or sockmap samples.
      
      Implementation notes:
      
      It seemed the simplest, to me at least, to use a refcnt to ensure
      psock is not lost across the sendmsg copy into the sg, the bpf program
      running on the data in sg_data, and the final pass to the TCP stack.
      Some performance testing may show a better method to do this and avoid
      the refcnt cost, but for now use the simpler method.
      
      Another item that will come after basic support is in place is
      supporting MSG_MORE flag. At the moment we call sendpages even if
      the MSG_MORE flag is set. An enhancement would be to collect the
      pages into a larger scatterlist and pass down the stack. Notice that
      bpf_tcp_sendmsg() could support this with some additional state saved
      across sendmsg calls. I built the code to support this without having
      to do refactoring work. Other features TBD include ZEROCOPY and the
      TCP_RECV_QUEUE/TCP_NO_QUEUE support. This will follow initial series
      shortly.
      
      Future work could improve size limits on the scatterlist rings used
      here. Currently, we use MAX_SKB_FRAGS simply because this was being
      used already in the TLS case. Future work could extend the kernel sk
      APIs to tune this depending on workload. This is a trade-off
      between memory usage and throughput performance.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      4f738adb
  8. 15 3月, 2018 1 次提交
    • S
      bpf: extend stackmap to save binary_build_id+offset instead of address · 615755a7
      Song Liu 提交于
      Currently, bpf stackmap store address for each entry in the call trace.
      To map these addresses to user space files, it is necessary to maintain
      the mapping from these virtual address to symbols in the binary. Usually,
      the user space profiler (such as perf) has to scan /proc/pid/maps at the
      beginning of profiling, and monitor mmap2() calls afterwards. Given the
      cost of maintaining the address map, this solution is not practical for
      system wide profiling that is always on.
      
      This patch tries to solve this problem with a variation of stackmap. This
      variation is enabled by flag BPF_F_STACK_BUILD_ID. Instead of storing
      addresses, the variation stores ELF file build_id + offset.
      
      Build ID is a 20-byte unique identifier for ELF files. The following
      command shows the Build ID of /bin/bash:
      
        [user@]$ readelf -n /bin/bash
        ...
          Build ID: XXXXXXXXXX
        ...
      
      With BPF_F_STACK_BUILD_ID, bpf_get_stackid() tries to parse Build ID
      for each entry in the call trace, and translate it into the following
      struct:
      
        struct bpf_stack_build_id_offset {
                __s32           status;
                unsigned char   build_id[BPF_BUILD_ID_SIZE];
                union {
                        __u64   offset;
                        __u64   ip;
                };
        };
      
      The search of build_id is limited to the first page of the file, and this
      page should be in page cache. Otherwise, we fallback to store ip for this
      entry (ip field in struct bpf_stack_build_id_offset). This requires the
      build_id to be stored in the first page. A quick survey of binary and
      dynamic library files in a few different systems shows that almost all
      binary and dynamic library files have build_id in the first page.
      
      Build_id is only meaningful for user stack. If a kernel stack is added to
      a stackmap with BPF_F_STACK_BUILD_ID, it will automatically fallback to
      only store ip (status == BPF_STACK_BUILD_ID_IP). Similarly, if build_id
      lookup failed for some reason, it will also fallback to store ip.
      
      User space can access struct bpf_stack_build_id_offset with bpf
      syscall BPF_MAP_LOOKUP_ELEM. It is necessary for user space to
      maintain mapping from build id to binary files. This mostly static
      mapping is much easier to maintain than per process address maps.
      
      Note: Stackmap with build_id only works in non-nmi context at this time.
      This is because we need to take mm->mmap_sem for find_vma(). If this
      changes, we would like to allow build_id lookup in nmi context.
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      615755a7
  9. 05 3月, 2018 1 次提交
  10. 26 1月, 2018 6 次提交
    • L
      bpf: Add BPF_SOCK_OPS_STATE_CB · d4487491
      Lawrence Brakmo 提交于
      Adds support for calling sock_ops BPF program when there is a TCP state
      change. Two arguments are used; one for the old state and another for
      the new state.
      
      There is a new enum in include/uapi/linux/bpf.h that exports the TCP
      states that prepends BPF_ to the current TCP state names. If it is ever
      necessary to change the internal TCP state values (other than adding
      more to the end), then it will become necessary to convert from the
      internal TCP state value to the BPF value before calling the BPF
      sock_ops function. There are a set of compile checks added in tcp.c
      to detect if the internal and BPF values differ so we can make the
      necessary fixes.
      
      New op: BPF_SOCK_OPS_STATE_CB.
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      d4487491
    • L
      bpf: Add BPF_SOCK_OPS_RETRANS_CB · a31ad29e
      Lawrence Brakmo 提交于
      Adds support for calling sock_ops BPF program when there is a
      retransmission. Three arguments are used; one for the sequence number,
      another for the number of segments retransmitted, and the last one for
      the return value of tcp_transmit_skb (0 => success).
      Does not include syn-ack retransmissions.
      
      New op: BPF_SOCK_OPS_RETRANS_CB.
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      a31ad29e
    • L
      bpf: Add support for reading sk_state and more · 44f0e430
      Lawrence Brakmo 提交于
      Add support for reading many more tcp_sock fields
      
        state,	same as sk->sk_state
        rtt_min	same as sk->rtt_min.s[0].v (current rtt_min)
        snd_ssthresh
        rcv_nxt
        snd_nxt
        snd_una
        mss_cache
        ecn_flags
        rate_delivered
        rate_interval_us
        packets_out
        retrans_out
        total_retrans
        segs_in
        data_segs_in
        segs_out
        data_segs_out
        lost_out
        sacked_out
        sk_txhash
        bytes_received (__u64)
        bytes_acked    (__u64)
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      44f0e430
    • L
      bpf: Add sock_ops RTO callback · f89013f6
      Lawrence Brakmo 提交于
      Adds an optional call to sock_ops BPF program based on whether the
      BPF_SOCK_OPS_RTO_CB_FLAG is set in bpf_sock_ops_flags.
      The BPF program is passed 2 arguments: icsk_retransmits and whether the
      RTO has expired.
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      f89013f6
    • L
      bpf: Adds field bpf_sock_ops_cb_flags to tcp_sock · b13d8807
      Lawrence Brakmo 提交于
      Adds field bpf_sock_ops_cb_flags to tcp_sock and bpf_sock_ops. Its primary
      use is to determine if there should be calls to sock_ops bpf program at
      various points in the TCP code. The field is initialized to zero,
      disabling the calls. A sock_ops BPF program can set it, per connection and
      as necessary, when the connection is established.
      
      It also adds support for reading and writting the field within a
      sock_ops BPF program. Reading is done by accessing the field directly.
      However, writing is done through the helper function
      bpf_sock_ops_cb_flags_set, in order to return an error if a BPF program
      is trying to set a callback that is not supported in the current kernel
      (i.e. running an older kernel). The helper function returns 0 if it was
      able to set all of the bits set in the argument, a positive number
      containing the bits that could not be set, or -EINVAL if the socket is
      not a full TCP socket.
      
      Examples of where one could call the bpf program:
      
      1) When RTO fires
      2) When a packet is retransmitted
      3) When the connection terminates
      4) When a packet is sent
      5) When a packet is received
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      b13d8807
    • L
      bpf: Support passing args to sock_ops bpf function · de525be2
      Lawrence Brakmo 提交于
      Adds support for passing up to 4 arguments to sock_ops bpf functions. It
      reusues the reply union, so the bpf_sock_ops structures are not
      increased in size.
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      de525be2
  11. 19 1月, 2018 2 次提交
  12. 15 1月, 2018 1 次提交
    • J
      bpf: offload: add map offload infrastructure · a3884572
      Jakub Kicinski 提交于
      BPF map offload follow similar path to program offload.  At creation
      time users may specify ifindex of the device on which they want to
      create the map.  Map will be validated by the kernel's
      .map_alloc_check callback and device driver will be called for the
      actual allocation.  Map will have an empty set of operations
      associated with it (save for alloc and free callbacks).  The real
      device callbacks are kept in map->offload->dev_ops because they
      have slightly different signatures.  Map operations are called in
      process context so the driver may communicate with HW freely,
      msleep(), wait() etc.
      
      Map alloc and free callbacks are muxed via existing .ndo_bpf, and
      are always called with rtnl lock held.  Maps and programs are
      guaranteed to be destroyed before .ndo_uninit (i.e. before
      unregister_netdev() returns).  Map callbacks are invoked with
      bpf_devs_lock *read* locked, drivers must take care of exclusive
      locking if necessary.
      
      All offload-specific branches are marked with unlikely() (through
      bpf_map_is_dev_bound()), given that branch penalty will be
      negligible compared to IO anyway, and we don't want to penalize
      SW path unnecessarily.
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Reviewed-by: NQuentin Monnet <quentin.monnet@netronome.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      a3884572
  13. 13 1月, 2018 1 次提交
  14. 06 1月, 2018 1 次提交
  15. 31 12月, 2017 1 次提交
  16. 19 12月, 2017 1 次提交
    • Y
      bpf/cgroup: fix a verification error for a CGROUP_DEVICE type prog · 06ef0ccb
      Yonghong Song 提交于
      The tools/testing/selftests/bpf test program
      test_dev_cgroup fails with the following error
      when compiled with llvm 6.0. (I did not try
      with earlier versions.)
      
        libbpf: load bpf program failed: Permission denied
        libbpf: -- BEGIN DUMP LOG ---
        libbpf:
        0: (61) r2 = *(u32 *)(r1 +4)
        1: (b7) r0 = 0
        2: (55) if r2 != 0x1 goto pc+8
         R0=inv0 R1=ctx(id=0,off=0,imm=0) R2=inv1 R10=fp0
        3: (69) r2 = *(u16 *)(r1 +0)
        invalid bpf_context access off=0 size=2
        ...
      
      The culprit is the following statement in dev_cgroup.c:
        short type = ctx->access_type & 0xFFFF;
      This code is typical as the ctx->access_type is assigned
      as below in kernel/bpf/cgroup.c:
        struct bpf_cgroup_dev_ctx ctx = {
              .access_type = (access << 16) | dev_type,
              .major = major,
              .minor = minor,
        };
      
      The compiler converts it to u16 access while
      the verifier cgroup_dev_is_valid_access rejects
      any non u32 access.
      
      This patch permits the field access_type to be accessible
      with type u16 and u8 as well.
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Tested-by: NRoman Gushchin <guro@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      06ef0ccb
  17. 18 12月, 2017 1 次提交
    • A
      bpf: introduce function calls (function boundaries) · cc8b0b92
      Alexei Starovoitov 提交于
      Allow arbitrary function calls from bpf function to another bpf function.
      
      Since the beginning of bpf all bpf programs were represented as a single function
      and program authors were forced to use always_inline for all functions
      in their C code. That was causing llvm to unnecessary inflate the code size
      and forcing developers to move code to header files with little code reuse.
      
      With a bit of additional complexity teach verifier to recognize
      arbitrary function calls from one bpf function to another as long as
      all of functions are presented to the verifier as a single bpf program.
      New program layout:
      r6 = r1    // some code
      ..
      r1 = ..    // arg1
      r2 = ..    // arg2
      call pc+1  // function call pc-relative
      exit
      .. = r1    // access arg1
      .. = r2    // access arg2
      ..
      call pc+20 // second level of function call
      ...
      
      It allows for better optimized code and finally allows to introduce
      the core bpf libraries that can be reused in different projects,
      since programs are no longer limited by single elf file.
      With function calls bpf can be compiled into multiple .o files.
      
      This patch is the first step. It detects programs that contain
      multiple functions and checks that calls between them are valid.
      It splits the sequence of bpf instructions (one program) into a set
      of bpf functions that call each other. Calls to only known
      functions are allowed. In the future the verifier may allow
      calls to unresolved functions and will do dynamic linking.
      This logic supports statically linked bpf functions only.
      
      Such function boundary detection could have been done as part of
      control flow graph building in check_cfg(), but it's cleaner to
      separate function boundary detection vs control flow checks within
      a subprogram (function) into logically indepedent steps.
      Follow up patches may split check_cfg() further, but not check_subprogs().
      
      Only allow bpf-to-bpf calls for root only and for non-hw-offloaded programs.
      These restrictions can be relaxed in the future.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      cc8b0b92
  18. 13 12月, 2017 1 次提交
  19. 05 12月, 2017 1 次提交
    • L
      bpf: Add access to snd_cwnd and others in sock_ops · f19397a5
      Lawrence Brakmo 提交于
      Adds read access to snd_cwnd and srtt_us fields of tcp_sock. Since these
      fields are only valid if the socket associated with the sock_ops program
      call is a full socket, the field is_fullsock is also added to the
      bpf_sock_ops struct. If the socket is not a full socket, reading these
      fields returns 0.
      
      Note that in most cases it will not be necessary to check is_fullsock to
      know if there is a full socket. The context of the call, as specified by
      the 'op' field, can sometimes determine whether there is a full socket.
      
      The struct bpf_sock_ops has the following fields added:
      
        __u32 is_fullsock;      /* Some TCP fields are only valid if
                                 * there is a full socket. If not, the
                                 * fields read as zero.
      			   */
        __u32 snd_cwnd;
        __u32 srtt_us;          /* Averaged RTT << 3 in usecs */
      
      There is a new macro, SOCK_OPS_GET_TCP32(NAME), to make it easier to add
      read access to more 32 bit tcp_sock fields.
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      f19397a5
  20. 21 11月, 2017 2 次提交
  21. 11 11月, 2017 2 次提交
  22. 05 11月, 2017 3 次提交
  23. 02 11月, 2017 1 次提交
    • G
      License cleanup: add SPDX license identifier to uapi header files with a license · e2be04c7
      Greg Kroah-Hartman 提交于
      Many user space API headers have licensing information, which is either
      incomplete, badly formatted or just a shorthand for referring to the
      license under which the file is supposed to be.  This makes it hard for
      compliance tools to determine the correct license.
      
      Update these files with an SPDX license identifier.  The identifier was
      chosen based on the license information in the file.
      
      GPL/LGPL licensed headers get the matching GPL/LGPL SPDX license
      identifier with the added 'WITH Linux-syscall-note' exception, which is
      the officially assigned exception identifier for the kernel syscall
      exception:
      
         NOTE! This copyright does *not* cover user programs that use kernel
         services by normal system calls - this is merely considered normal use
         of the kernel, and does *not* fall under the heading of "derived work".
      
      This exception makes it possible to include GPL headers into non GPL
      code, without confusing license compliance tools.
      
      Headers which have either explicit dual licensing or are just licensed
      under a non GPL license are updated with the corresponding SPDX
      identifier and the GPLv2 with syscall exception identifier.  The format
      is:
              ((GPL-2.0 WITH Linux-syscall-note) OR SPDX-ID-OF-OTHER-LICENSE)
      
      SPDX license identifiers are a legally binding shorthand, which can be
      used instead of the full boiler plate text.  The update does not remove
      existing license information as this has to be done on a case by case
      basis and the copyright holders might have to be consulted. This will
      happen in a separate step.
      
      This patch is based on work done by Thomas Gleixner and Kate Stewart and
      Philippe Ombredanne.  See the previous patch in this series for the
      methodology of how this patch was researched.
      Reviewed-by: NKate Stewart <kstewart@linuxfoundation.org>
      Reviewed-by: NPhilippe Ombredanne <pombredanne@nexb.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e2be04c7