1. 20 4月, 2018 5 次提交
    • M
      bpf: btf: Add BPF_BTF_LOAD command · f56a653c
      Martin KaFai Lau 提交于
      This patch adds a BPF_BTF_LOAD command which
      1) loads and verifies the BTF (implemented in earlier patches)
      2) returns a BTF fd to userspace.  In the next patch, the
         BTF fd can be specified during BPF_MAP_CREATE.
      
      It currently limits to CAP_SYS_ADMIN.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      f56a653c
    • M
      bpf: btf: Add pretty print capability for data with BTF type info · b00b8dae
      Martin KaFai Lau 提交于
      This patch adds pretty print capability for data with BTF type info.
      The current usage is to allow pretty print for a BPF map.
      
      The next few patches will allow a read() on a pinned map with BTF
      type info for its key and value.
      
      This patch uses the seq_printf() infra.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      b00b8dae
    • M
      bpf: btf: Check members of struct/union · 179cde8c
      Martin KaFai Lau 提交于
      This patch checks a few things of struct's members:
      
      1) It has a valid size (e.g. a "const void" is invalid)
      2) A member's size (+ its member's offset) does not exceed
         the containing struct's size.
      3) The member's offset satisfies the alignment requirement
      
      The above can only be done after the needs_resolve member's type
      is resolved.  Hence, the above is done together in
      btf_struct_resolve().
      
      Each possible member's type (e.g. int, enum, modifier...) implements
      the check_member() ops which will be called from btf_struct_resolve().
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      179cde8c
    • M
      bpf: btf: Validate type reference · eb3f595d
      Martin KaFai Lau 提交于
      After collecting all btf_type in the first pass in an earlier patch,
      the second pass (in this patch) can validate the reference types
      (e.g. the referring type does exist and it does not refer to itself).
      
      While checking the reference type, it also gathers other information (e.g.
      the size of an array).  This info will be useful in checking the
      struct's members in a later patch.  They will also be useful in doing
      pretty print later.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      eb3f595d
    • M
      bpf: btf: Introduce BPF Type Format (BTF) · 69b693f0
      Martin KaFai Lau 提交于
      This patch introduces BPF type Format (BTF).
      
      BTF (BPF Type Format) is the meta data format which describes
      the data types of BPF program/map.  Hence, it basically focus
      on the C programming language which the modern BPF is primary
      using.  The first use case is to provide a generic pretty print
      capability for a BPF map.
      
      BTF has its root from CTF (Compact C-Type format).  To simplify
      the handling of BTF data, BTF removes the differences between
      small and big type/struct-member.  Hence, BTF consistently uses u32
      instead of supporting both "one u16" and "two u32 (+padding)" in
      describing type and struct-member.
      
      It also raises the number of types (and functions) limit
      from 0x7fff to 0x7fffffff.
      
      Due to the above changes,  the format is not compatible to CTF.
      Hence, BTF starts with a new BTF_MAGIC and version number.
      
      This patch does the first verification pass to the BTF.  The first
      pass checks:
      1. meta-data size (e.g. It does not go beyond the total btf's size)
      2. name_offset is valid
      3. Each BTF_KIND (e.g. int, enum, struct....) does its
         own check of its meta-data.
      
      Some other checks, like checking a struct's member is referring
      to a valid type, can only be done in the second pass.  The second
      verification pass will be implemented in the next patch.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      69b693f0
  2. 17 4月, 2018 3 次提交
    • J
      xdp: transition into using xdp_frame for return API · 03993094
      Jesper Dangaard Brouer 提交于
      Changing API xdp_return_frame() to take struct xdp_frame as argument,
      seems like a natural choice. But there are some subtle performance
      details here that needs extra care, which is a deliberate choice.
      
      When de-referencing xdp_frame on a remote CPU during DMA-TX
      completion, result in the cache-line is change to "Shared"
      state. Later when the page is reused for RX, then this xdp_frame
      cache-line is written, which change the state to "Modified".
      
      This situation already happens (naturally) for, virtio_net, tun and
      cpumap as the xdp_frame pointer is the queued object.  In tun and
      cpumap, the ptr_ring is used for efficiently transferring cache-lines
      (with pointers) between CPUs. Thus, the only option is to
      de-referencing xdp_frame.
      
      It is only the ixgbe driver that had an optimization, in which it can
      avoid doing the de-reference of xdp_frame.  The driver already have
      TX-ring queue, which (in case of remote DMA-TX completion) have to be
      transferred between CPUs anyhow.  In this data area, we stored a
      struct xdp_mem_info and a data pointer, which allowed us to avoid
      de-referencing xdp_frame.
      
      To compensate for this, a prefetchw is used for telling the cache
      coherency protocol about our access pattern.  My benchmarks show that
      this prefetchw is enough to compensate the ixgbe driver.
      
      V7: Adjust for commit d9314c47 ("i40e: add support for XDP_REDIRECT")
      V8: Adjust for commit bd658dda ("net/mlx5e: Separate dma base address
      and offset in dma_sync call")
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      03993094
    • J
      bpf: cpumap convert to use generic xdp_frame · 70280ed9
      Jesper Dangaard Brouer 提交于
      The generic xdp_frame format, was inspired by the cpumap own internal
      xdp_pkt format.  It is now time to convert it over to the generic
      xdp_frame format.  The cpumap needs one extra field dev_rx.
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      70280ed9
    • J
      xdp: introduce xdp_return_frame API and use in cpumap · 5ab073ff
      Jesper Dangaard Brouer 提交于
      Introduce an xdp_return_frame API, and convert over cpumap as
      the first user, given it have queued XDP frame structure to leverage.
      
      V3: Cleanup and remove C99 style comments, pointed out by Alex Duyck.
      V6: Remove comment that id will be added later (Req by Alex Duyck)
      V8: Rename enum mem_type to xdp_mem_type (found by kbuild test robot)
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5ab073ff
  3. 04 4月, 2018 3 次提交
  4. 31 3月, 2018 5 次提交
    • A
      bpf: Post-hooks for sys_bind · aac3fc32
      Andrey Ignatov 提交于
      "Post-hooks" are hooks that are called right before returning from
      sys_bind. At this time IP and port are already allocated and no further
      changes to `struct sock` can happen before returning from sys_bind but
      BPF program has a chance to inspect the socket and change sys_bind
      result.
      
      Specifically it can e.g. inspect what port was allocated and if it
      doesn't satisfy some policy, BPF program can force sys_bind to fail and
      return EPERM to user.
      
      Another example of usage is recording the IP:port pair to some map to
      use it in later calls to sys_connect. E.g. if some TCP server inside
      cgroup was bound to some IP:port_n, it can be recorded to a map. And
      later when some TCP client inside same cgroup is trying to connect to
      127.0.0.1:port_n, BPF hook for sys_connect can override the destination
      and connect application to IP:port_n instead of 127.0.0.1:port_n. That
      helps forcing all applications inside a cgroup to use desired IP and not
      break those applications if they e.g. use localhost to communicate
      between each other.
      
      == Implementation details ==
      
      Post-hooks are implemented as two new attach types
      `BPF_CGROUP_INET4_POST_BIND` and `BPF_CGROUP_INET6_POST_BIND` for
      existing prog type `BPF_PROG_TYPE_CGROUP_SOCK`.
      
      Separate attach types for IPv4 and IPv6 are introduced to avoid access
      to IPv6 field in `struct sock` from `inet_bind()` and to IPv4 field from
      `inet6_bind()` since those fields might not make sense in such cases.
      Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      aac3fc32
    • A
      bpf: Hooks for sys_connect · d74bad4e
      Andrey Ignatov 提交于
      == The problem ==
      
      See description of the problem in the initial patch of this patch set.
      
      == The solution ==
      
      The patch provides much more reliable in-kernel solution for the 2nd
      part of the problem: making outgoing connecttion from desired IP.
      
      It adds new attach types `BPF_CGROUP_INET4_CONNECT` and
      `BPF_CGROUP_INET6_CONNECT` for program type
      `BPF_PROG_TYPE_CGROUP_SOCK_ADDR` that can be used to override both
      source and destination of a connection at connect(2) time.
      
      Local end of connection can be bound to desired IP using newly
      introduced BPF-helper `bpf_bind()`. It allows to bind to only IP though,
      and doesn't support binding to port, i.e. leverages
      `IP_BIND_ADDRESS_NO_PORT` socket option. There are two reasons for this:
      * looking for a free port is expensive and can affect performance
        significantly;
      * there is no use-case for port.
      
      As for remote end (`struct sockaddr *` passed by user), both parts of it
      can be overridden, remote IP and remote port. It's useful if an
      application inside cgroup wants to connect to another application inside
      same cgroup or to itself, but knows nothing about IP assigned to the
      cgroup.
      
      Support is added for IPv4 and IPv6, for TCP and UDP.
      
      IPv4 and IPv6 have separate attach types for same reason as sys_bind
      hooks, i.e. to prevent reading from / writing to e.g. user_ip6 fields
      when user passes sockaddr_in since it'd be out-of-bound.
      
      == Implementation notes ==
      
      The patch introduces new field in `struct proto`: `pre_connect` that is
      a pointer to a function with same signature as `connect` but is called
      before it. The reason is in some cases BPF hooks should be called way
      before control is passed to `sk->sk_prot->connect`. Specifically
      `inet_dgram_connect` autobinds socket before calling
      `sk->sk_prot->connect` and there is no way to call `bpf_bind()` from
      hooks from e.g. `ip4_datagram_connect` or `ip6_datagram_connect` since
      it'd cause double-bind. On the other hand `proto.pre_connect` provides a
      flexible way to add BPF hooks for connect only for necessary `proto` and
      call them at desired time before `connect`. Since `bpf_bind()` is
      allowed to bind only to IP and autobind in `inet_dgram_connect` binds
      only port there is no chance of double-bind.
      
      bpf_bind() sets `force_bind_address_no_port` to bind to only IP despite
      of value of `bind_address_no_port` socket field.
      
      bpf_bind() sets `with_lock` to `false` when calling to __inet_bind()
      and __inet6_bind() since all call-sites, where bpf_bind() is called,
      already hold socket lock.
      Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      d74bad4e
    • A
      bpf: Hooks for sys_bind · 4fbac77d
      Andrey Ignatov 提交于
      == The problem ==
      
      There is a use-case when all processes inside a cgroup should use one
      single IP address on a host that has multiple IP configured.  Those
      processes should use the IP for both ingress and egress, for TCP and UDP
      traffic. So TCP/UDP servers should be bound to that IP to accept
      incoming connections on it, and TCP/UDP clients should make outgoing
      connections from that IP. It should not require changing application
      code since it's often not possible.
      
      Currently it's solved by intercepting glibc wrappers around syscalls
      such as `bind(2)` and `connect(2)`. It's done by a shared library that
      is preloaded for every process in a cgroup so that whenever TCP/UDP
      server calls `bind(2)`, the library replaces IP in sockaddr before
      passing arguments to syscall. When application calls `connect(2)` the
      library transparently binds the local end of connection to that IP
      (`bind(2)` with `IP_BIND_ADDRESS_NO_PORT` to avoid performance penalty).
      
      Shared library approach is fragile though, e.g.:
      * some applications clear env vars (incl. `LD_PRELOAD`);
      * `/etc/ld.so.preload` doesn't help since some applications are linked
        with option `-z nodefaultlib`;
      * other applications don't use glibc and there is nothing to intercept.
      
      == The solution ==
      
      The patch provides much more reliable in-kernel solution for the 1st
      part of the problem: binding TCP/UDP servers on desired IP. It does not
      depend on application environment and implementation details (whether
      glibc is used or not).
      
      It adds new eBPF program type `BPF_PROG_TYPE_CGROUP_SOCK_ADDR` and
      attach types `BPF_CGROUP_INET4_BIND` and `BPF_CGROUP_INET6_BIND`
      (similar to already existing `BPF_CGROUP_INET_SOCK_CREATE`).
      
      The new program type is intended to be used with sockets (`struct sock`)
      in a cgroup and provided by user `struct sockaddr`. Pointers to both of
      them are parts of the context passed to programs of newly added types.
      
      The new attach types provides hooks in `bind(2)` system call for both
      IPv4 and IPv6 so that one can write a program to override IP addresses
      and ports user program tries to bind to and apply such a program for
      whole cgroup.
      
      == Implementation notes ==
      
      [1]
      Separate attach types for `AF_INET` and `AF_INET6` are added
      intentionally to prevent reading/writing to offsets that don't make
      sense for corresponding socket family. E.g. if user passes `sockaddr_in`
      it doesn't make sense to read from / write to `user_ip6[]` context
      fields.
      
      [2]
      The write access to `struct bpf_sock_addr_kern` is implemented using
      special field as an additional "register".
      
      There are just two registers in `sock_addr_convert_ctx_access`: `src`
      with value to write and `dst` with pointer to context that can't be
      changed not to break later instructions. But the fields, allowed to
      write to, are not available directly and to access them address of
      corresponding pointer has to be loaded first. To get additional register
      the 1st not used by `src` and `dst` one is taken, its content is saved
      to `bpf_sock_addr_kern.tmp_reg`, then the register is used to load
      address of pointer field, and finally the register's content is restored
      from the temporary field after writing `src` value.
      Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      4fbac77d
    • A
      bpf: Check attach type at prog load time · 5e43f899
      Andrey Ignatov 提交于
      == The problem ==
      
      There are use-cases when a program of some type can be attached to
      multiple attach points and those attach points must have different
      permissions to access context or to call helpers.
      
      E.g. context structure may have fields for both IPv4 and IPv6 but it
      doesn't make sense to read from / write to IPv6 field when attach point
      is somewhere in IPv4 stack.
      
      Same applies to BPF-helpers: it may make sense to call some helper from
      some attach point, but not from other for same prog type.
      
      == The solution ==
      
      Introduce `expected_attach_type` field in in `struct bpf_attr` for
      `BPF_PROG_LOAD` command. If scenario described in "The problem" section
      is the case for some prog type, the field will be checked twice:
      
      1) At load time prog type is checked to see if attach type for it must
         be known to validate program permissions correctly. Prog will be
         rejected with EINVAL if it's the case and `expected_attach_type` is
         not specified or has invalid value.
      
      2) At attach time `attach_type` is compared with `expected_attach_type`,
         if prog type requires to have one, and, if they differ, attach will
         be rejected with EINVAL.
      
      The `expected_attach_type` is now available as part of `struct bpf_prog`
      in both `bpf_verifier_ops->is_valid_access()` and
      `bpf_verifier_ops->get_func_proto()` () and can be used to check context
      accesses and calls to helpers correspondingly.
      
      Initially the idea was discussed by Alexei Starovoitov <ast@fb.com> and
      Daniel Borkmann <daniel@iogearbox.net> here:
      https://marc.info/?l=linux-netdev&m=152107378717201&w=2Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      5e43f899
    • P
      bpf: sockmap: initialize sg table entries properly · 6ef6d84c
      Prashant Bhole 提交于
      When CONFIG_DEBUG_SG is set, sg->sg_magic is initialized in
      sg_init_table() and it is verified in sg api while navigating. We hit
      BUG_ON when magic check is failed.
      
      In functions sg_tcp_sendpage and sg_tcp_sendmsg, the struct containing
      the scatterlist is already zeroed out. So to avoid extra memset, we
      use sg_init_marker() to initialize sg_magic.
      
      Fixed following things:
      - In bpf_tcp_sendpage: initialize sg using sg_init_marker
      - In bpf_tcp_sendmsg: Replace sg_init_table with sg_init_marker
      - In bpf_tcp_push: Replace memset with sg_init_table where consumed
        sg entry needs to be re-initialized.
      Signed-off-by: NPrashant Bhole <bhole_prashant_q7@lab.ntt.co.jp>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      6ef6d84c
  5. 30 3月, 2018 2 次提交
  6. 29 3月, 2018 1 次提交
    • A
      bpf: introduce BPF_RAW_TRACEPOINT · c4f6699d
      Alexei Starovoitov 提交于
      Introduce BPF_PROG_TYPE_RAW_TRACEPOINT bpf program type to access
      kernel internal arguments of the tracepoints in their raw form.
      
      >From bpf program point of view the access to the arguments look like:
      struct bpf_raw_tracepoint_args {
             __u64 args[0];
      };
      
      int bpf_prog(struct bpf_raw_tracepoint_args *ctx)
      {
        // program can read args[N] where N depends on tracepoint
        // and statically verified at program load+attach time
      }
      
      kprobe+bpf infrastructure allows programs access function arguments.
      This feature allows programs access raw tracepoint arguments.
      
      Similar to proposed 'dynamic ftrace events' there are no abi guarantees
      to what the tracepoints arguments are and what their meaning is.
      The program needs to type cast args properly and use bpf_probe_read()
      helper to access struct fields when argument is a pointer.
      
      For every tracepoint __bpf_trace_##call function is prepared.
      In assembler it looks like:
      (gdb) disassemble __bpf_trace_xdp_exception
      Dump of assembler code for function __bpf_trace_xdp_exception:
         0xffffffff81132080 <+0>:     mov    %ecx,%ecx
         0xffffffff81132082 <+2>:     jmpq   0xffffffff811231f0 <bpf_trace_run3>
      
      where
      
      TRACE_EVENT(xdp_exception,
              TP_PROTO(const struct net_device *dev,
                       const struct bpf_prog *xdp, u32 act),
      
      The above assembler snippet is casting 32-bit 'act' field into 'u64'
      to pass into bpf_trace_run3(), while 'dev' and 'xdp' args are passed as-is.
      All of ~500 of __bpf_trace_*() functions are only 5-10 byte long
      and in total this approach adds 7k bytes to .text.
      
      This approach gives the lowest possible overhead
      while calling trace_xdp_exception() from kernel C code and
      transitioning into bpf land.
      Since tracepoint+bpf are used at speeds of 1M+ events per second
      this is valuable optimization.
      
      The new BPF_RAW_TRACEPOINT_OPEN sys_bpf command is introduced
      that returns anon_inode FD of 'bpf-raw-tracepoint' object.
      
      The user space looks like:
      // load bpf prog with BPF_PROG_TYPE_RAW_TRACEPOINT type
      prog_fd = bpf_prog_load(...);
      // receive anon_inode fd for given bpf_raw_tracepoint with prog attached
      raw_tp_fd = bpf_raw_tracepoint_open("xdp_exception", prog_fd);
      
      Ctrl-C of tracing daemon or cmdline tool that uses this feature
      will automatically detach bpf program, unload it and
      unregister tracepoint probe.
      
      On the kernel side the __bpf_raw_tp_map section of pointers to
      tracepoint definition and to __bpf_trace_*() probe function is used
      to find a tracepoint with "xdp_exception" name and
      corresponding __bpf_trace_xdp_exception() probe function
      which are passed to tracepoint_probe_register() to connect probe
      with tracepoint.
      
      Addition of bpf_raw_tracepoint doesn't interfere with ftrace and perf
      tracepoint mechanisms. perf_event_open() can be used in parallel
      on the same tracepoint.
      Multiple bpf_raw_tracepoint_open("xdp_exception", prog_fd) are permitted.
      Each with its own bpf program. The kernel will execute
      all tracepoint probes and all attached bpf programs.
      
      In the future bpf_raw_tracepoints can be extended with
      query/introspection logic.
      
      __bpf_raw_tp_map section logic was contributed by Steven Rostedt
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Acked-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      c4f6699d
  7. 28 3月, 2018 1 次提交
  8. 26 3月, 2018 2 次提交
  9. 24 3月, 2018 1 次提交
    • J
      bpf: Remove struct bpf_verifier_env argument from print_bpf_insn · abe08840
      Jiri Olsa 提交于
      We use print_bpf_insn in user space (bpftool and soon perf),
      so it'd be nice to keep it generic and strip it off the kernel
      struct bpf_verifier_env argument.
      
      This argument can be safely removed, because its users can
      use the struct bpf_insn_cbs::private_data to pass it.
      
      By changing the argument type  we can no longer have clean
      'verbose' alias to 'bpf_verifier_log_write' in verifier.c.
      Instead  we're adding the  'verbose' cb_print callback and
      removing the alias.
      
      This way we have new cb_print callback in place, and all
      the 'verbose(env, ...) calls in verifier.c will cleanly
      cast to 'verbose(void *, ...)' so no other change is
      needed.
      Signed-off-by: NJiri Olsa <jolsa@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      abe08840
  10. 21 3月, 2018 1 次提交
  11. 20 3月, 2018 2 次提交
    • J
      bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data · 4f738adb
      John Fastabend 提交于
      This implements a BPF ULP layer to allow policy enforcement and
      monitoring at the socket layer. In order to support this a new
      program type BPF_PROG_TYPE_SK_MSG is used to run the policy at
      the sendmsg/sendpage hook. To attach the policy to sockets a
      sockmap is used with a new program attach type BPF_SK_MSG_VERDICT.
      
      Similar to previous sockmap usages when a sock is added to a
      sockmap, via a map update, if the map contains a BPF_SK_MSG_VERDICT
      program type attached then the BPF ULP layer is created on the
      socket and the attached BPF_PROG_TYPE_SK_MSG program is run for
      every msg in sendmsg case and page/offset in sendpage case.
      
      BPF_PROG_TYPE_SK_MSG Semantics/API:
      
      BPF_PROG_TYPE_SK_MSG supports only two return codes SK_PASS and
      SK_DROP. Returning SK_DROP free's the copied data in the sendmsg
      case and in the sendpage case leaves the data untouched. Both cases
      return -EACESS to the user. Returning SK_PASS will allow the msg to
      be sent.
      
      In the sendmsg case data is copied into kernel space buffers before
      running the BPF program. The kernel space buffers are stored in a
      scatterlist object where each element is a kernel memory buffer.
      Some effort is made to coalesce data from the sendmsg call here.
      For example a sendmsg call with many one byte iov entries will
      likely be pushed into a single entry. The BPF program is run with
      data pointers (start/end) pointing to the first sg element.
      
      In the sendpage case data is not copied. We opt not to copy the
      data by default here, because the BPF infrastructure does not
      know what bytes will be needed nor when they will be needed. So
      copying all bytes may be wasteful. Because of this the initial
      start/end data pointers are (0,0). Meaning no data can be read or
      written. This avoids reading data that may be modified by the
      user. A new helper is added later in this series if reading and
      writing the data is needed. The helper call will do a copy by
      default so that the page is exclusively owned by the BPF call.
      
      The verdict from the BPF_PROG_TYPE_SK_MSG applies to the entire msg
      in the sendmsg() case and the entire page/offset in the sendpage case.
      This avoids ambiguity on how to handle mixed return codes in the
      sendmsg case. Again a helper is added later in the series if
      a verdict needs to apply to multiple system calls and/or only
      a subpart of the currently being processed message.
      
      The helper msg_redirect_map() can be used to select the socket to
      send the data on. This is used similar to existing redirect use
      cases. This allows policy to redirect msgs.
      
      Pseudo code simple example:
      
      The basic logic to attach a program to a socket is as follows,
      
        // load the programs
        bpf_prog_load(SOCKMAP_TCP_MSG_PROG, BPF_PROG_TYPE_SK_MSG,
      		&obj, &msg_prog);
      
        // lookup the sockmap
        bpf_map_msg = bpf_object__find_map_by_name(obj, "my_sock_map");
      
        // get fd for sockmap
        map_fd_msg = bpf_map__fd(bpf_map_msg);
      
        // attach program to sockmap
        bpf_prog_attach(msg_prog, map_fd_msg, BPF_SK_MSG_VERDICT, 0);
      
      Adding sockets to the map is done in the normal way,
      
        // Add a socket 'fd' to sockmap at location 'i'
        bpf_map_update_elem(map_fd_msg, &i, fd, BPF_ANY);
      
      After the above any socket attached to "my_sock_map", in this case
      'fd', will run the BPF msg verdict program (msg_prog) on every
      sendmsg and sendpage system call.
      
      For a complete example see BPF selftests or sockmap samples.
      
      Implementation notes:
      
      It seemed the simplest, to me at least, to use a refcnt to ensure
      psock is not lost across the sendmsg copy into the sg, the bpf program
      running on the data in sg_data, and the final pass to the TCP stack.
      Some performance testing may show a better method to do this and avoid
      the refcnt cost, but for now use the simpler method.
      
      Another item that will come after basic support is in place is
      supporting MSG_MORE flag. At the moment we call sendpages even if
      the MSG_MORE flag is set. An enhancement would be to collect the
      pages into a larger scatterlist and pass down the stack. Notice that
      bpf_tcp_sendmsg() could support this with some additional state saved
      across sendmsg calls. I built the code to support this without having
      to do refactoring work. Other features TBD include ZEROCOPY and the
      TCP_RECV_QUEUE/TCP_NO_QUEUE support. This will follow initial series
      shortly.
      
      Future work could improve size limits on the scatterlist rings used
      here. Currently, we use MAX_SKB_FRAGS simply because this was being
      used already in the TLS case. Future work could extend the kernel sk
      APIs to tune this depending on workload. This is a trade-off
      between memory usage and throughput performance.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      4f738adb
    • J
      sockmap: convert refcnt to an atomic refcnt · ffa35660
      John Fastabend 提交于
      The sockmap refcnt up until now has been wrapped in the
      sk_callback_lock(). So its not actually needed any locking of its
      own. The counter itself tracks the lifetime of the psock object.
      Sockets in a sockmap have a lifetime that is independent of the
      map they are part of. This is possible because a single socket may
      be in multiple maps. When this happens we can only release the
      psock data associated with the socket when the refcnt reaches
      zero. There are three possible delete sock reference decrement
      paths first through the normal sockmap process, the user deletes
      the socket from the map. Second the map is removed and all sockets
      in the map are removed, delete path is similar to case 1. The third
      case is an asyncronous socket event such as a closing the socket. The
      last case handles removing sockets that are no longer available.
      For completeness, although inc does not pose any problems in this
      patch series, the inc case only happens when a psock is added to a
      map.
      
      Next we plan to add another socket prog type to handle policy and
      monitoring on the TX path. When we do this however we will need to
      keep a reference count open across the sendmsg/sendpage call and
      holding the sk_callback_lock() here (on every send) seems less than
      ideal, also it may sleep in cases where we hit memory pressure.
      Instead of dealing with these issues in some clever way simply make
      the reference counting a refcnt_t type and do proper atomic ops.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      ffa35660
  12. 15 3月, 2018 1 次提交
    • S
      bpf: extend stackmap to save binary_build_id+offset instead of address · 615755a7
      Song Liu 提交于
      Currently, bpf stackmap store address for each entry in the call trace.
      To map these addresses to user space files, it is necessary to maintain
      the mapping from these virtual address to symbols in the binary. Usually,
      the user space profiler (such as perf) has to scan /proc/pid/maps at the
      beginning of profiling, and monitor mmap2() calls afterwards. Given the
      cost of maintaining the address map, this solution is not practical for
      system wide profiling that is always on.
      
      This patch tries to solve this problem with a variation of stackmap. This
      variation is enabled by flag BPF_F_STACK_BUILD_ID. Instead of storing
      addresses, the variation stores ELF file build_id + offset.
      
      Build ID is a 20-byte unique identifier for ELF files. The following
      command shows the Build ID of /bin/bash:
      
        [user@]$ readelf -n /bin/bash
        ...
          Build ID: XXXXXXXXXX
        ...
      
      With BPF_F_STACK_BUILD_ID, bpf_get_stackid() tries to parse Build ID
      for each entry in the call trace, and translate it into the following
      struct:
      
        struct bpf_stack_build_id_offset {
                __s32           status;
                unsigned char   build_id[BPF_BUILD_ID_SIZE];
                union {
                        __u64   offset;
                        __u64   ip;
                };
        };
      
      The search of build_id is limited to the first page of the file, and this
      page should be in page cache. Otherwise, we fallback to store ip for this
      entry (ip field in struct bpf_stack_build_id_offset). This requires the
      build_id to be stored in the first page. A quick survey of binary and
      dynamic library files in a few different systems shows that almost all
      binary and dynamic library files have build_id in the first page.
      
      Build_id is only meaningful for user stack. If a kernel stack is added to
      a stackmap with BPF_F_STACK_BUILD_ID, it will automatically fallback to
      only store ip (status == BPF_STACK_BUILD_ID_IP). Similarly, if build_id
      lookup failed for some reason, it will also fallback to store ip.
      
      User space can access struct bpf_stack_build_id_offset with bpf
      syscall BPF_MAP_LOOKUP_ELEM. It is necessary for user space to
      maintain mapping from build id to binary files. This mostly static
      mapping is much easier to maintain than per process address maps.
      
      Note: Stackmap with build_id only works in non-nmi context at this time.
      This is because we need to take mm->mmap_sem for find_vma(). If this
      changes, we would like to allow build_id lookup in nmi context.
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      615755a7
  13. 09 3月, 2018 1 次提交
    • Q
      bpf: comment why dots in filenames under BPF virtual FS are not allowed · 6d8cb045
      Quentin Monnet 提交于
      When pinning a file under the BPF virtual file system (traditionally
      /sys/fs/bpf), using a dot in the name of the location to pin at is not
      allowed. For example, trying to pin at "/sys/fs/bpf/foo.bar" will be
      rejected with -EPERM.
      
      This check was introduced at the same time as the BPF file system
      itself, with commit b2197755 ("bpf: add support for persistent
      maps/progs"). At this time, it was checked in a function called
      "bpf_dname_reserved()", which made clear that using a dot was reserved
      for future extensions.
      
      This function disappeared and the check was moved elsewhere with commit
      0c93b7d8 ("bpf: reject invalid names right in ->lookup()"), and the
      meaning of the dot ban was lost.
      
      The present commit simply adds a comment in the source to explain to the
      reader that the usage of dots is reserved for future usage.
      Signed-off-by: NQuentin Monnet <quentin.monnet@netronome.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      6d8cb045
  14. 24 2月, 2018 1 次提交
    • D
      bpf: allow xadd only on aligned memory · ca369602
      Daniel Borkmann 提交于
      The requirements around atomic_add() / atomic64_add() resp. their
      JIT implementations differ across architectures. E.g. while x86_64
      seems just fine with BPF's xadd on unaligned memory, on arm64 it
      triggers via interpreter but also JIT the following crash:
      
        [  830.864985] Unable to handle kernel paging request at virtual address ffff8097d7ed6703
        [...]
        [  830.916161] Internal error: Oops: 96000021 [#1] SMP
        [  830.984755] CPU: 37 PID: 2788 Comm: test_verifier Not tainted 4.16.0-rc2+ #8
        [  830.991790] Hardware name: Huawei TaiShan 2280 /BC11SPCD, BIOS 1.29 07/17/2017
        [  830.998998] pstate: 80400005 (Nzcv daif +PAN -UAO)
        [  831.003793] pc : __ll_sc_atomic_add+0x4/0x18
        [  831.008055] lr : ___bpf_prog_run+0x1198/0x1588
        [  831.012485] sp : ffff00001ccabc20
        [  831.015786] x29: ffff00001ccabc20 x28: ffff8017d56a0f00
        [  831.021087] x27: 0000000000000001 x26: 0000000000000000
        [  831.026387] x25: 000000c168d9db98 x24: 0000000000000000
        [  831.031686] x23: ffff000008203878 x22: ffff000009488000
        [  831.036986] x21: ffff000008b14e28 x20: ffff00001ccabcb0
        [  831.042286] x19: ffff0000097b5080 x18: 0000000000000a03
        [  831.047585] x17: 0000000000000000 x16: 0000000000000000
        [  831.052885] x15: 0000ffffaeca8000 x14: 0000000000000000
        [  831.058184] x13: 0000000000000000 x12: 0000000000000000
        [  831.063484] x11: 0000000000000001 x10: 0000000000000000
        [  831.068783] x9 : 0000000000000000 x8 : 0000000000000000
        [  831.074083] x7 : 0000000000000000 x6 : 000580d428000000
        [  831.079383] x5 : 0000000000000018 x4 : 0000000000000000
        [  831.084682] x3 : ffff00001ccabcb0 x2 : 0000000000000001
        [  831.089982] x1 : ffff8097d7ed6703 x0 : 0000000000000001
        [  831.095282] Process test_verifier (pid: 2788, stack limit = 0x0000000018370044)
        [  831.102577] Call trace:
        [  831.105012]  __ll_sc_atomic_add+0x4/0x18
        [  831.108923]  __bpf_prog_run32+0x4c/0x70
        [  831.112748]  bpf_test_run+0x78/0xf8
        [  831.116224]  bpf_prog_test_run_xdp+0xb4/0x120
        [  831.120567]  SyS_bpf+0x77c/0x1110
        [  831.123873]  el0_svc_naked+0x30/0x34
        [  831.127437] Code: 97fffe97 17ffffec 00000000 f9800031 (885f7c31)
      
      Reason for this is because memory is required to be aligned. In
      case of BPF, we always enforce alignment in terms of stack access,
      but not when accessing map values or packet data when the underlying
      arch (e.g. arm64) has CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS set.
      
      xadd on packet data that is local to us anyway is just wrong, so
      forbid this case entirely. The only place where xadd makes sense in
      fact are map values; xadd on stack is wrong as well, but it's been
      around for much longer. Specifically enforce strict alignment in case
      of xadd, so that we handle this case generically and avoid such crashes
      in the first place.
      
      Fixes: 17a52670 ("bpf: verifier (add verifier core)")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      ca369602
  15. 23 2月, 2018 2 次提交
    • Y
      bpf: fix rcu lockdep warning for lpm_trie map_free callback · 6c5f6102
      Yonghong Song 提交于
      Commit 9a3efb6b ("bpf: fix memory leak in lpm_trie map_free callback function")
      fixed a memory leak and removed unnecessary locks in map_free callback function.
      Unfortrunately, it introduced a lockdep warning. When lockdep checking is turned on,
      running tools/testing/selftests/bpf/test_lpm_map will have:
      
        [   98.294321] =============================
        [   98.294807] WARNING: suspicious RCU usage
        [   98.295359] 4.16.0-rc2+ #193 Not tainted
        [   98.295907] -----------------------------
        [   98.296486] /home/yhs/work/bpf/kernel/bpf/lpm_trie.c:572 suspicious rcu_dereference_check() usage!
        [   98.297657]
        [   98.297657] other info that might help us debug this:
        [   98.297657]
        [   98.298663]
        [   98.298663] rcu_scheduler_active = 2, debug_locks = 1
        [   98.299536] 2 locks held by kworker/2:1/54:
        [   98.300152]  #0:  ((wq_completion)"events"){+.+.}, at: [<00000000196bc1f0>] process_one_work+0x157/0x5c0
        [   98.301381]  #1:  ((work_completion)(&map->work)){+.+.}, at: [<00000000196bc1f0>] process_one_work+0x157/0x5c0
      
      Since actual trie tree removal happens only after no other
      accesses to the tree are possible, replacing
        rcu_dereference_protected(*slot, lockdep_is_held(&trie->lock))
      with
        rcu_dereference_protected(*slot, 1)
      fixed the issue.
      
      Fixes: 9a3efb6b ("bpf: fix memory leak in lpm_trie map_free callback function")
      Reported-by: NEric Dumazet <edumazet@google.com>
      Suggested-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      6c5f6102
    • E
      bpf: add schedule points in percpu arrays management · 32fff239
      Eric Dumazet 提交于
      syszbot managed to trigger RCU detected stalls in
      bpf_array_free_percpu()
      
      It takes time to allocate a huge percpu map, but even more time to free
      it.
      
      Since we run in process context, use cond_resched() to yield cpu if
      needed.
      
      Fixes: a10423b8 ("bpf: introduce BPF_MAP_TYPE_PERCPU_ARRAY map")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      32fff239
  16. 16 2月, 2018 1 次提交
    • D
      bpf: fix mlock precharge on arraymaps · 9c2d63b8
      Daniel Borkmann 提交于
      syzkaller recently triggered OOM during percpu map allocation;
      while there is work in progress by Dennis Zhou to add __GFP_NORETRY
      semantics for percpu allocator under pressure, there seems also a
      missing bpf_map_precharge_memlock() check in array map allocation.
      
      Given today the actual bpf_map_charge_memlock() happens after the
      find_and_alloc_map() in syscall path, the bpf_map_precharge_memlock()
      is there to bail out early before we go and do the map setup work
      when we find that we hit the limits anyway. Therefore add this for
      array map as well.
      
      Fixes: 6c905981 ("bpf: pre-allocate hash map elements")
      Fixes: a10423b8 ("bpf: introduce BPF_MAP_TYPE_PERCPU_ARRAY map")
      Reported-by: syzbot+adb03f3f0bb57ce3acda@syzkaller.appspotmail.com
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Cc: Dennis Zhou <dennisszhou@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      9c2d63b8
  17. 15 2月, 2018 2 次提交
  18. 14 2月, 2018 3 次提交
  19. 06 2月, 2018 2 次提交
    • J
      bpf: sockmap, fix leaking maps with attached but not detached progs · 3d9e9526
      John Fastabend 提交于
      When a program is attached to a map we increment the program refcnt
      to ensure that the program is not removed while it is potentially
      being referenced from sockmap side. However, if this same program
      also references the map (this is a reasonably common pattern in
      my programs) then the verifier will also increment the maps refcnt
      from the verifier. This is to ensure the map doesn't get garbage
      collected while the program has a reference to it.
      
      So we are left in a state where the map holds the refcnt on the
      program stopping it from being removed and releasing the map refcnt.
      And vice versa the program holds a refcnt on the map stopping it
      from releasing the refcnt on the prog.
      
      All this is fine as long as users detach the program while the
      map fd is still around. But, if the user omits this detach command
      we are left with a dangling map we can no longer release.
      
      To resolve this when the map fd is released decrement the program
      references and remove any reference from the map to the program.
      This fixes the issue with possibly dangling map and creates a
      user side API constraint. That is, the map fd must be held open
      for programs to be attached to a map.
      
      Fixes: 174a79ff ("bpf: sockmap with sk redirect support")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      3d9e9526
    • J
      bpf: sockmap, add sock close() hook to remove socks · 1aa12bdf
      John Fastabend 提交于
      The selftests test_maps program was leaving dangling BPF sockmap
      programs around because not all psock elements were removed from
      the map. The elements in turn hold a reference on the BPF program
      they are attached to causing BPF programs to stay open even after
      test_maps has completed.
      
      The original intent was that sk_state_change() would be called
      when TCP socks went through TCP_CLOSE state. However, because
      socks may be in SOCK_DEAD state or the sock may be a listening
      socket the event is not always triggered.
      
      To resolve this use the ULP infrastructure and register our own
      proto close() handler. This fixes the above case.
      
      Fixes: 174a79ff ("bpf: sockmap with sk redirect support")
      Reported-by: NPrashant Bhole <bhole_prashant_q7@lab.ntt.co.jp>
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      1aa12bdf
  20. 03 2月, 2018 1 次提交
    • A
      bpf: fix bpf_prog_array_copy_to_user() issues · 0911287c
      Alexei Starovoitov 提交于
      1. move copy_to_user out of rcu section to fix the following issue:
      
      ./include/linux/rcupdate.h:302 Illegal context switch in RCU read-side critical section!
      stack backtrace:
       __dump_stack lib/dump_stack.c:17 [inline]
       dump_stack+0x194/0x257 lib/dump_stack.c:53
       lockdep_rcu_suspicious+0x123/0x170 kernel/locking/lockdep.c:4592
       rcu_preempt_sleep_check include/linux/rcupdate.h:301 [inline]
       ___might_sleep+0x385/0x470 kernel/sched/core.c:6079
       __might_sleep+0x95/0x190 kernel/sched/core.c:6067
       __might_fault+0xab/0x1d0 mm/memory.c:4532
       _copy_to_user+0x2c/0xc0 lib/usercopy.c:25
       copy_to_user include/linux/uaccess.h:155 [inline]
       bpf_prog_array_copy_to_user+0x217/0x4d0 kernel/bpf/core.c:1587
       bpf_prog_array_copy_info+0x17b/0x1c0 kernel/bpf/core.c:1685
       perf_event_query_prog_array+0x196/0x280 kernel/trace/bpf_trace.c:877
       _perf_ioctl kernel/events/core.c:4737 [inline]
       perf_ioctl+0x3e1/0x1480 kernel/events/core.c:4757
      
      2. move *prog under rcu, since it's not ok to dereference it afterwards
      
      3. in a rare case of prog array being swapped between bpf_prog_array_length()
         and bpf_prog_array_copy_to_user() calls make sure to copy zeros to user space,
         so the user doesn't walk over uninited prog_ids while kernel reported
         uattr->query.prog_cnt > 0
      
      Reported-by: syzbot+7dbcd2d3b85f9b608b23@syzkaller.appspotmail.com
      Fixes: 468e2f64 ("bpf: introduce BPF_PROG_QUERY command")
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      0911287c